sergiocorreia commented 7 years ago

Filters

First, each filter needs a specific structure, as seen here and here

The key thing is the main() hook:

Every panflute script needs to end up with this:

def main(doc=None):
     return pf.run_filter(action, doc=doc)

if __name__ == '__main__':
    main()

Or a variant of this, but always i) with a main() function, ii) that receives an optional argument doc, which is sent to pf.run_filter (or any of run_filters, toJSONFilter, toJSONFilters), and iii) return s the output of the call

YAML for filters

Optionally, filters have an accompanying YAML file, as here: https://github.com/sergiocorreia/panflute-filters/blob/master/filters/debug.yaml

The metadata shown in the example is currently overkill, but ideally it should be used to construct a gallery of filters, search for specific filters, update them when a new version appears, etc.

Index of filters

It's a simple YAML file that points to the yaml (or .py) files: https://github.com/pandoc-extras/packages/blob/master/filters.yaml

Everything is easy to extend to things besides filters (in this case, just have a separate yaml file)

panflute autofilters

You can have metadata in the form of panflute-filters: somefilter or panflute-filters: [filter1, filter2]. Additionally, you can have panflute-verbose: true and panflute-path: somepath entries.

Panflute will search in the current dir, or datapath, or the path indicated in the metadata, or $PATH, for the filter, and if found, run it.

Note: this is currently not integrated with pandocpm, so no auto-installs will occur

Downloading with `pandocpm`

After installation, type pandocpm --help

As an example, this are some common patterns:

pandocpm install filter debug
pandocpm install filter debug --verbose
pandocpm install filter debug --replace
pandocpm uninstall filter debug

You can also set specific folders to install, or alternative indexes

Pending work

A way to update/update all filters
Port and add filters to the index
Go beyond filters
A way to list/search filters
Go beyond panflute filters into more general filters (this should already be possible through pip installs)
There are probably quite a few bugs, as this is quite alpha
We need much more documentation

Edit: the checklist bubble is removed and migrated to #3.

ickc commented 7 years ago

I finally have some time to test it. It works great. I probably will spend more time on this later this week.

A few random notes:

Eventually, should this one include panflute and pandocfilters as a dependencies? The benefits are:
- the end-user only need to install this. (and in the end it can even be packaged installers to include them)
- the same mechanism seems works as well for pandocfilters
A trusted "list" so that panflute can install the filters automatically if it doesn't find it, and if it is in the trusted "list". That "list" could be just a centralized organization like this one, or, a separate yaml that someone maintains. In case of centralized organization, it is optional (they just have to use this package manager or some other ways to install it). For example, that's what textmate did: most bundles live in the Textmate organization. The difference is that one is organized by the Textmate's author while ours are 3rd party. May be I should ask people's opinion on this on pandoc-discuss.

ickc commented 7 years ago

Another question is if it should point to a fixed version (or particular commit) or always the latest version. The former approach allows matching SHA256 sum, and point to a ''stable" version that guarantee to work. But then if there's any major change in pandoc requiring all individual filters to be changed, a lot of manual updating is needed (but this shouldn't happen very often if at all. e.g. last change only requires the pandocfilters to be updated, but not the filters written in pandocfilters to)

sergiocorreia commented 7 years ago

Great to hear that it works on your side. About your points:

We could potentially include both as dependencies, the overhead is almost nil if we do so.
About the auto-installs, that makes sense and is the natural thing to implement in panflute. Maybe we can have a trusted branch, so it is easy to update wrt to master or other branches?
Another question is whether we want a set of packages/filters.yaml files, or if we want to use folders instead: packages/filters/somefilter.yaml (can we do without the yaml in this case?). Having all filters in one file seems simple, but at the same time means that pandocpm would have to download increasingly larger files, and more importantly, that git commits would all involve one file (with increased risks of merge commits).
About the last point, I'm not sure. It makes sense to have SHA256 checksums, but that would make custom filters harder (e.g. what if the filter involves just cabal install xyz or pip instal xyz from .... We can't do a checksum for that)

ickc commented 7 years ago

I revisited homebrew and homebrew-cask. I think a solution to our problem is already there:

the .yaml we have, homebrew call that a formula. In their case, the formula are in the central homebrew repository. Each formula holds info for 1 software, much like the panflute-filters/debug.yaml at master · sergiocorreia/panflute-filters. But AFAIK there's no index like packages/filters.yaml at master · pandoc-extras/packages.
The filter author continues to host their filter wherever they like. But if they want the package manager to support it, they need to separately submit a "formula" to the centralize repository. (which they need to do anyway if you use an index file instead). So basically there can be only 1 kind of .yaml to handle, the formula, and no index.
Any users can also submit formula to our repository, this need not be done by the filter author. e.g. brew-cask allows one to install Microsoft Office 2016 on Mac. And there's almost no way a Microsoft staff would submit a formula to them. In our case, we then don't have to worry about the adoption problem. Because whether the formula is submitted does not depends on the author's interested, but whoever is going to benefit from the package manager (users). Well, at least it can be done by ourselves.
About the SHA256 checksum, like brew-cask, they have a mechanism to disable it (as long as the maintainer accept the pull request). For ours, I think anything installed by other package managers should have this disabled, since I believe they have their own way to handle security/malware. So SHA256 checksum should and must be used for filters installed by pandocpm directly. A corollary is that it shouldn't point to the latest version, but only a certain version/commit.

These are what I learnt from homebrew (which becomes the package manager for macOS). They have extensive manuals and contribution guidelines. I might read them more in details later to see what to learn and borrow. (They definitely need to worry & process a lot more than us do. And they strongly relies on git and GitHub throughout.)

By the way, they have something called "tap", essentially a git repository hosting formula. They have a mechanism to "tap" into a repository unknown to brew. To us, it means effectively it lets the package manager to trust these other formula that pandocpm originally don't trust. I don't know if it will ever be a problem to us though. Because probably the only reason someone need to create custom tap is that homebrew don't accept their formula from a pull request (not stable, deprecated, etc.).

ickc commented 7 years ago

I remembered you mentioned panzer before, it seems to have a machanism to specify filters used in yaml already, how deep do you think the integration between panzer, panflute (that also can specify filters) and pandocpm (which installs filters)can be?

sergiocorreia commented 7 years ago

I think there should be clearly delimited boundaries between the three. Integration has advantages but the huge disadvantage is complexity (we don't have the manpower required to deal with that).

Now, how can the three interact?

pandocpm can host panzer style files in the same way it hosts filters and templates.
users of panzer could use pandocpm to ensure their filters are installed.
pandocpm makes no difference between panflute and pandocfilters et al, which is a plus.

sergiocorreia commented 7 years ago

Having yamls instead of index files sounds interesting, I'll give that a shot.

We can also ask pandocpm to use other repos, which would work equivalently.

I'm not sure about the complexity of using brew as either back or front end. Would have to read more about it, but my guess is that it probably has a lot of Mac-specific stuff and might not even work on some Linux distros and of course windows

sergiocorreia commented 7 years ago

note: out of all the package managers (gems, pip, php, node-npm, bower, cpan, brew), the spec of the ruby gems seem most useful: http://guides.rubygems.org/specification-reference/

Thus, the packages repo would have the following structure

/packages/filters/myfilter.yaml
/packages/filters/anotherfilter.yaml
/packages/templates/sometemplate.yaml
/packages/csl/somecsl.yaml
/packages/style/somestyle.yaml

And each .yaml file could have these fields:

version: 1.0.0
license: MIT
summary: 'Some filter"
description: 'long description goes here'
author = xyz # or authors as a list
files: [xyz.py, abc.py] # This would just copy the files to the $datadir/filters folder
url: the url were the files are located
installer: pep # or cabal, etc. , this would run "pep install xyz" instead of copying files
homepage: 'https://github/someone/somerepo'

You could also use other yaml fields for whatever reason

ickc commented 7 years ago

I'm not sure about the complexity of using brew as either back or front end.

I'm not suggesting using brew. I merely was talking about studying what it does and borrow ideas. Definitely if I write formula in brew, it would work, for mac and Linux. But I never heard of people porting it to Windows, and since it relies a lot on Unix commands, I think it probably can't be done.

Another thing is, at least for Python filters, may be a script can be wrote to parse the setup.py and convert to yaml. (I think brew has something like that and there's regular commits on formula by "robots")

ickc commented 7 years ago

Do we agree to centralize the YAML formula?

i.e. "YAML for filters" will be in "pandoc-extras/packages", and there will be no "Index of filters".

sergiocorreia commented 7 years ago

Would that involve having to host the filters/templates/etc on the packages repo? (I think linking to packages is easier and more likely to work than hosting the packages or even using git submodules)

ickc commented 7 years ago

Earlier on, I suggested using SHA-256 checksum. But now I think there can be a much simpler approach for us:

centralizing all formula (except when specified explicitly).
- make sure the integrity of the formula
- side-benefit: any filter author wanting to use pandocpm don't have to manage the formula in 2 places (1 in their own repo, another in the master index in our repo)
The urls in the formula needed to fit a specific requirement: rather than pointing to a generic url that might have its content changing over time, one need to specifically point to a "static target".
- The easiest way to accomplish this will be to use the url to a particular commit. e.g. https://github.com/sergiocorreia/panflute-filters/blob/ba9aa0184bc3d1fd27f2ac4922f17943c2bc9b69/filters/debug.py. Since the commit hash uses SHA-1, while not as secure as SHA-256, should still be quite secure, and can save us a lot of work.
- This applies only to "simple" formula. For "pip" formula, we again just "blindly trust PyPI", which we couldn't do anything about.
- The biggest drawback will be that there's no way we can just point to a branch, since a branch is a moving target. It is not a drawback of this approach however. Any kind of security we want to impose will have this problem. To sort of point to a branch, e.g.,
  - https://github.com/sergiocorreia/panflute/tree/1826a16d2a6a691b1e00efbf5e8d305ce948ab33 points to the last known commit of panflute/master
  - https://github.com/sergiocorreia/panflute/tree/9cc5148a69aca4cdd38669934dd4086c143f330f points to the last known commit of panflute/python2

Reference on SHA-1 Vulnerability

In fact, in 2012 noted security researcher Bruce Schneier reported the calculations of Intel researcher Jesse Walker, who found that the estimated cost of performing a SHA-1 collision attack will be within the range of organized crime by 2018 and for a university project by 2021. Walker’s estimate suggested then that a SHA-1 collision would cost $2 million in 2012, $700,000 in 2015, $173,000 in 2018 and $43,000 in 2021. From Understanding SHA-1 Vulnerabilities — Is SSL No Longer Secure? - Entrust, Inc..

I guess it wouldn't be too worrying for our applications.

ickc commented 7 years ago

Would that involve having to host the filters/templates/etc on the packages repo? (I think linking to packages is easier and more likely to work than hosting the packages or even using git submodules)

No. Only the YAML formula will be centrally-hosted. It is very similar to how homebrew-cask host formula in this aspect.

Sidenote: centralized repo for simple packages

However, I'm considering providing repositories for centralized packages (totally optional). Because I see from pandocfilters/pandoc-templates pull request, there seems a need in this area. Sometimes for very simple filters/templates, a dedicated repository seems over kill, while a single file "repository" like gist might not be up to the job. e.g. One want to have at least 3 files: the package, the markdown source, and the native from them (for tests).

Regarding native files for tests, they are not the standards (e.g. not in pandocfilters/panflute's example folders). But I think if we are to offer such centralize repositories for simple packages, I will require them to write a simple test, to minimize any extra workloads on us.
there will be a standard organization structure to make the tests full automated.
when tests fails on newer versions of pandocfilters/panflute/pandoc, the original author would be called to fix it. If no one is fixing it, it would be retired (say in an archive folder, meaning pandocpm install also wouldn't be able to install it to prevent problems. The corresponding YAML formula can be added a warning message for such cases.).

ickc commented 7 years ago

Auto-filter

By the way, because of the proposed security features, I think once they are implemented, panflute can be allowed to run pandocpm automatically, to make it just works.

Another related question is, do you think it is possible to implement autofilters for pandocfilters? Except the need to rewrite filters to have a main function, are there other problem?

An alternative approach of auto-filter would be, rather than having an auto-filter in panflute and panflute calling pandocpm, may be the auto-filter can be in pandocpm instead, where pandocpm lists panflute (and possibly pandocfilters) as dependency. Then the pandocpm as a filter can do everything under the hood:

pip install pandocpm
add the filter names in the YAML of the markdown
pandoc -F pandocpm ...

Edit: a way to circumvent the main function problem is to embed the name of the main/action function in the yaml formula.

sergiocorreia commented 7 years ago

(Sorry for the late reply, it's been quite busy at the office lately)

Hosting the recipes seems interesting, as well as the test thing. One concern I have with the tests is that it's a lot of work, and might require extra dependencies (a lot of filters rely on external sources). This means authors might not want to write them.

Another related question is, do you think it is possible to implement autofilters for pandocfilters? Except the need to rewrite filters to have a main function, are there other problem?

autofilters just calls a python function, so it does not even know if it's calling a panflute or pandocfilter-based filter. So AFAIK just adding the main() function (with the correct arguments and return) should work. A problem though is that pandocfilters don't return anything (they just rely on toJSONFilter that writes to stdout).

ickc commented 7 years ago

Ah, I might not have been very clear. The 2 different kinds of centralizing is completely unrelated. Let's forget the centralized simple filters and tests for a moment (which is a separate project and still relies on the architecture below):

As far as the ability of install packages through pandocpm, the only centralization I'm proposing is the formula only. i.e. the formula is the only yaml one needs to write, and will not be besides their package, but in our centralized formula repository. (i.e. either no index, or index generated by the individual formula. Either way the package author need not to touch the index.)

sergiocorreia commented 7 years ago

What would be the smallest/simplest formula that we could require for now? (to get started)

ickc commented 7 years ago

What would be the smallest/simplest formula that we could require for now? (to get started)

Didn't think it through yet. I didn't read through all your code yet, from what I've read so far,

concerning the index:

Currently in the index, the non-simple kind has a complex structure (because if the package is non-simple, there's no obvious place to put a formula in the non-centralized formula "paradigm"). I think the complex structure should move to an individual formula now as the formulae are centralized.
it means the url-type would be moved to individual formula as well, possibly renamed to/add a type to denote if it is a simple package (standalone, single file) or requires other package management (e.g. pip)
since the formulae are centralized, url is not needed in the index.

After all these, it means the index is just a list of the names of packages. i.e. we might as well get rid of the index, or the index can be auto-generated, and then the index can simply be a plain text list of names, rather than in yaml.

concerning the individual formula, in addition to the existing

the complex structure from the index
url-type and/or type as said above
more description on the non-simple filter. Currently, the only non-simple one is pip. It will be a problem for Unix users since the OSes ship with Python2 as the default, and it is considered a bad practice to override the default python version. Hence using pip alone means it will use pip2 for them. So an info on the python version of the package is needed: python2, python3, or univeral, corresponds to pip2, pip3, and pip.
- This can be acheived by having a separate key to indicate the python version, or simply having the types as simple|pip|pip2|pip3.

ickc commented 7 years ago

Edit: forget what I said. I see that your earlier proposal on the formula spec already included the license key.

ickc commented 7 years ago

This issue is split into #3, #5, #6, #7, #8, #9, since this is over-long and touched on very different issues that is hard to follow.

pandoc-extras / pandocpm

How this works #2

Filters

YAML for filters

Index of filters

panflute autofilters

Downloading with `pandocpm`

Pending work

Reference on SHA-1 Vulnerability

Sidenote: centralized repo for simple packages

Auto-filter

pandoc-extras / pandocpm

How this works #2

Filters

YAML for filters

Index of filters

panflute autofilters

Downloading with pandocpm

Pending work

Reference on SHA-1 Vulnerability

Sidenote: centralized repo for simple packages

Auto-filter

Downloading with `pandocpm`