sergiocorreia / panflute

An Pythonic alternative to John MacFarlane's pandocfilters, with extra helper functions
http://scorreia.com/software/panflute/
BSD 3-Clause "New" or "Revised" License
493 stars 60 forks source link

Misc. ideas #12

Closed sergiocorreia closed 7 years ago

sergiocorreia commented 7 years ago
  1. The .toJSONFilter() and .toJSONFilters() method names are not Pythonic at all (and hard to understand unless you were a previous user of pandocfilters or know the internals of Pandoc). Maybe change it to something like .run_filter() and .run_filters() (but keep the old names as wrappers for compat!)
  2. Running several filters one after another is slow, because each filter has to decode JSON from stdin and then encode it back, and that's where most of the time is spent.
  3. Maybe we can list the filters we want to run in the YAML metadata (eg: panflute-filters: onefilter, another).
  4. Then, we can do python somedoc.md -F panflute, and panflute itself can be used as a filter that calls the filters listed in the metadata. This fixes problem 2.

TLDR:

sergiocorreia commented 7 years ago

A few useful filters should be more easily available:

sergiocorreia commented 7 years ago
sergiocorreia commented 7 years ago
ickc commented 7 years ago

Allow panflute to be run as a filter, where it calls the list of filters listed in the metadata.

Are there any interest in turning this repo to be a centralized panflute filters gallery? I'm building an extended version of panflute csv2table based on yours in ickc/pandoc-table-csv-test/panflute-csv2table.ipynb. I almost finished it (need to think about the exact metadata keys to use, cleanup, etc.) and am thinking about how to distribute it.

From pandoc-discuss we discussed the need of a centralized pandoc filters library, as well as being easy to install. I'm thinking may be we can start from panflute? So, say, everyone made pull-request of their scripts into panflute (with some minimum requirement, say, naming scheme, version numbering, etc.), and then they will be bundled with panflute, with the said metadata controls which filters are used. And all people need will be adding, say, --filter=panflute in the pandoc arg.

I am considering porting my pandoc-amsthm in panflute too. And I need a variants of pandoc-includes (the one on panflute seems great). I considered writing haskell filters but it is a pain to make sure the colleagues can install it. pip is much easier (because python is almost ubiquitous) but @jgm specifically said his pandocfilters isn't a centralized repository. So you are my last hope to streamline the use of pandoc filters. No pressure though. 😄

kdheepak commented 7 years ago

I agree that having a centralized repository will be the best path forward. Would this fit an organizational structure better?

sergiocorreia commented 7 years ago

This would be a cool thing to have. Now, how would this work exactly?

About the role of panflute: maybe we can list the filters used as metadata, and then have panflute auto-install them from this repo?

ickc commented 7 years ago

Organization

I have been thinking about setting up a GitHub organization about pandoc. It would actually be nice to have pandoc/panflute etc. all fall under 1 umbrella organization. I didn't ask @jgm but think that probably he wouldn't want to do that.

[Sidenote: About organization: sadly, pandoc has already been taken, by some guy that has 1 repo with no active development, and the contents are pirated Chinese fictions in pandoc markdown. And I already filed a complaint (that the content violates copyright and hence the GitHub terms and conditions), but GitHub refuses to take it down and requires the copyright owner to do so. However, while I "know" the copyright owners (best authors among Chinese fiction writer), they don't know me.]

Anyway, I suggest if an organization is setup, its name should be more generic, and allow the inclusion of projects other than panflute. This will becomes the "centralized gallery" I've been talking about. Possible names are

all filters in one repo

I agree, wherever that repo is (say if GitHub Organization is used), panflute being able to auto-install it behind the scene would be excellent. In the latest version of pandoc, it means just putting it in data-dir/filters, which seems more secure. But in earlier version of pandoc, it means panflute need to either put those filters in the PATH, or export the path panflute is installing to PATH. Either way, it is insecure. I guess if this feature is implemented, we should say this is for pandoc >= 1.18 only.

Centralized Gallery

I kind of did the CSV thing in ickc/pandoc-filters/pandoc-filters.csv for currently available filters (but far from finished).

I think the list of all (panflute-)filters should fall in the same repo that contains those filter (say, panflute-filters), for easier organization. We can ask whoever making the pull request to the repo also adds their entry in the list (with a link to the documentation perhaps).

However there can be another separate repo that contain references to filters not in panflute-filters. (may be just transfer mine to the organization).

I think if we could auto-generate a website gallery of it, it would be great for filter discovery. I have some vague ideas about it, but don't know what's the best way to do it. (gh-pages has more limitation but seamless to GitHub. Travis is needed for test anyway, but requires more setting to customize a website build. And then there will be a question on which one to use, jekyll, yst, makefile+pandoc, etc.)

ickc commented 7 years ago

Just to mention another bonus of having a centralized repo for panflute filters: the naming scheme for filters can be shorter. Currently, people called the filters like pandoc-includes, pandoc-csv2tables, pandoc-placetables, pandoc-amsthm, etc. because they are submitted to cabal/pip, etc. and the prepended pandoc is for identification among the seas of packages. If the panflute filters fall in one repo, the prepended string won't be necessary, which allow a cleaner, shortner name.

sergiocorreia commented 7 years ago

I like your proposal, but my main concern is that complexity can explode. I think that there are several interlinked issues that we will benefit from treating separately:

  1. Filter hosting: let's not host all filters in one repo, it's not really needed and would create barriers to adoption. Something that might be useful is to allow for yaml files with the description of each filter (e.g. if you have pandoc-csv.py, then also have pandoc-csv.yaml, that has labels, description, sample usage, etc.)
  2. Instead, let's have a repo that lists the filters. So whenever you add a filter and want it indexed, just submit a PR with a one-line change. This repo can be in the pandoc-extras org (something like pandoc-extras/panflute-filters)
  3. Independently of that, someone can scrape the repos listed in (2) and create a nice gallery, also using the metadata discussed in (1).
  4. Finally, installing can be done from panflute.

About step 4, do you know how to use setup.py to include executable files? (I think it's called entry points). It would be cool if we allow panflute to be a filter, so if you do pandoc -F panflute .. then panflute checks the metadata and download+runs the required filters.

ickc commented 7 years ago

panflute as filter

I think it is something like

entry_points={
    `console_scripts`: [
        `panflute = panflute:main',
    ],
},

(And you need to provide a __main__.) If you want cli options, getopt would work.

Centralized repo or not?

One of the complexity involved and needed to balance is security. Let's say panflute choose the safest approach that only copy it to $DATA-DIR/filters and support this feature (of auto-download filters) for pandoc >= 1.18 only. Even in this case, there might be security implication since a user might have formerly added $DATA-DIR/filters to their PATH (when they were working with an earlier version of pandoc). So anything copied to that folder would be in the PATH and executable (probably, depends on how the user setup) without sudo. So then the panflute will open a point of attack to install arbitrary code.

And even if $DATA-DIR/filters is not in the PATH, panflute running the filters automatically still means it's an opening for attack.

[sidenote: I'm considering writing a filter that can execute code in the markdown source, say, through exec or ! in iPython. This also have security implication. And hypothetically, say, if such filter make a pull request to the said centralized-repository, I'm not sure if it should be accepted for the sanitization for security reason.]

That's the reason behind having the filters hosted in the same centralized repository. This way, the core-developers can verify the code is not malicious, and any change to the code requires a separate pull request for sanitization.

[another sidenote: I think the closest thing to our idea is \usepackage in LaTeX. Arbitrary \usepackage can be specified in the document, so the packages are centralized in CTAN for sanitization and distribution.]

If we really do not want centralized hosting, then we might need to learn from the example of how, say, brew handles it. For each additional unknown repository to add, you need to brew tap into that (manually). And then brew will also calculate the SHA-256 sum to check the source hasn't been modified (meaning if the source is modified, a separate pull request is required to update the SHA-256 sum, hence in principle it is sanitized.) This approach however, will take away the "seamless" part of our (at least my) dream.

But I understand the concern about complexity. For example, we can defines rules of submitting the filter (including a clear standard on specifying the author). Every issues submitted has to call the name of the author, and let the author deals with the bug (this is how Travis CI provides 3rd party/community-based languages). In addition, tests, docs, might also be required.

There's potentially a problem of resistance to adoption, and might consume too much time (who knows how much more busy we will becomes). But I think the security issue is more important. panflute will be given too much power (by downloading arbitrary executable codes, either directly or indirectly), and hence we should guard the filters it can download more carefully.

On the other hand, the added barrier might means a high quality of filters submitted, and lesser pull request to deal with. Given the pandoc community is relatively small and (probably) not much people are writing pandoc filters (although I'm sure one of the goal of panflute is to change this!), it seems probably we won't be too busy. (a data to backup this argument is, after a decade, the list in Pandoc Filters · jgm/pandoc Wiki is not very long. I'm sure only the people are motivated enough to put a link to their filter in pandoc wiki will be motivated to try a centralized filter repo.

By the way, I don't think having their filters submitted to a centralized repo means they can't have their own repo. Just like CTAN, some of the sources are elsewhere (say, in GitHub). They can even write a script to prepare their codes to be summited in our centralized repo.

sergiocorreia commented 7 years ago

Closing this as all the ideas are now either in separate issues or have been implemented.

Also see: https://github.com/sergiocorreia/panflute/projects/1