Closed sergiocorreia closed 7 years ago
I finally have some time to test it. It works great. I probably will spend more time on this later this week.
A few random notes:
Eventually, should this one include panflute and pandocfilters as a dependencies? The benefits are:
A trusted "list" so that panflute can install the filters automatically if it doesn't find it, and if it is in the trusted "list". That "list" could be just a centralized organization like this one, or, a separate yaml that someone maintains. In case of centralized organization, it is optional (they just have to use this package manager or some other ways to install it). For example, that's what textmate did: most bundles live in the Textmate organization. The difference is that one is organized by the Textmate's author while ours are 3rd party. May be I should ask people's opinion on this on pandoc-discuss.
Another question is if it should point to a fixed version (or particular commit) or always the latest version. The former approach allows matching SHA256 sum, and point to a ''stable" version that guarantee to work. But then if there's any major change in pandoc requiring all individual filters to be changed, a lot of manual updating is needed (but this shouldn't happen very often if at all. e.g. last change only requires the pandocfilters to be updated, but not the filters written in pandocfilters to)
Great to hear that it works on your side. About your points:
trusted
branch, so it is easy to update wrt to master or other branches?packages/filters.yaml
files, or if we want to use folders instead: packages/filters/somefilter.yaml
(can we do without the yaml in this case?). Having all filters in one file seems simple, but at the same time means that pandocpm would have to download increasingly larger files, and more importantly, that git commits would all involve one file (with increased risks of merge commits).cabal install xyz
or pip instal xyz from ...
. We can't do a checksum for that)I revisited homebrew and homebrew-cask. I think a solution to our problem is already there:
the .yaml
we have, homebrew call that a formula. In their case, the formula are in the central homebrew repository. Each formula holds info for 1 software, much like the panflute-filters/debug.yaml at master · sergiocorreia/panflute-filters. But AFAIK there's no index like packages/filters.yaml at master · pandoc-extras/packages.
The filter author continues to host their filter wherever they like. But if they want the package manager to support it, they need to separately submit a "formula" to the centralize repository. (which they need to do anyway if you use an index file instead). So basically there can be only 1 kind of .yaml
to handle, the formula, and no index.
Any users can also submit formula to our repository, this need not be done by the filter author. e.g. brew-cask allows one to install Microsoft Office 2016 on Mac. And there's almost no way a Microsoft staff would submit a formula to them. In our case, we then don't have to worry about the adoption problem. Because whether the formula is submitted does not depends on the author's interested, but whoever is going to benefit from the package manager (users). Well, at least it can be done by ourselves.
About the SHA256 checksum, like brew-cask, they have a mechanism to disable it (as long as the maintainer accept the pull request). For ours, I think anything installed by other package managers should have this disabled, since I believe they have their own way to handle security/malware. So SHA256 checksum should and must be used for filters installed by pandocpm directly. A corollary is that it shouldn't point to the latest version, but only a certain version/commit.
These are what I learnt from homebrew (which becomes the package manager for macOS). They have extensive manuals and contribution guidelines. I might read them more in details later to see what to learn and borrow. (They definitely need to worry & process a lot more than us do. And they strongly relies on git and GitHub throughout.)
By the way, they have something called "tap", essentially a git repository hosting formula. They have a mechanism to "tap" into a repository unknown to brew. To us, it means effectively it lets the package manager to trust these other formula that pandocpm originally don't trust. I don't know if it will ever be a problem to us though. Because probably the only reason someone need to create custom tap is that homebrew don't accept their formula from a pull request (not stable, deprecated, etc.).
I remembered you mentioned panzer before, it seems to have a machanism to specify filters used in yaml already, how deep do you think the integration between panzer, panflute (that also can specify filters) and pandocpm (which installs filters)can be?
I think there should be clearly delimited boundaries between the three. Integration has advantages but the huge disadvantage is complexity (we don't have the manpower required to deal with that).
Now, how can the three interact?
Having yamls instead of index files sounds interesting, I'll give that a shot.
We can also ask pandocpm to use other repos, which would work equivalently.
I'm not sure about the complexity of using brew as either back or front end. Would have to read more about it, but my guess is that it probably has a lot of Mac-specific stuff and might not even work on some Linux distros and of course windows
note: out of all the package managers (gems, pip, php, node-npm, bower, cpan, brew), the spec of the ruby gems seem most useful: http://guides.rubygems.org/specification-reference/
Thus, the packages repo would have the following structure
/packages/filters/myfilter.yaml
/packages/filters/anotherfilter.yaml
/packages/templates/sometemplate.yaml
/packages/csl/somecsl.yaml
/packages/style/somestyle.yaml
And each .yaml
file could have these fields:
version: 1.0.0
license: MIT
summary: 'Some filter"
description: 'long description goes here'
author = xyz # or authors as a list
files: [xyz.py, abc.py] # This would just copy the files to the $datadir/filters folder
url: the url were the files are located
installer: pep # or cabal, etc. , this would run "pep install xyz" instead of copying files
homepage: 'https://github/someone/somerepo'
You could also use other yaml fields for whatever reason
I'm not sure about the complexity of using brew as either back or front end.
I'm not suggesting using brew. I merely was talking about studying what it does and borrow ideas. Definitely if I write formula in brew, it would work, for mac and Linux. But I never heard of people porting it to Windows, and since it relies a lot on Unix commands, I think it probably can't be done.
Another thing is, at least for Python filters, may be a script can be wrote to parse the setup.py and convert to yaml. (I think brew has something like that and there's regular commits on formula by "robots")
Do we agree to centralize the YAML formula?
i.e. "YAML for filters" will be in "pandoc-extras/packages", and there will be no "Index of filters".
Would that involve having to host the filters/templates/etc on the packages repo? (I think linking to packages is easier and more likely to work than hosting the packages or even using git submodules)
Earlier on, I suggested using SHA-256 checksum. But now I think there can be a much simpler approach for us:
centralizing all formula (except when specified explicitly).
make sure the integrity of the formula
side-benefit: any filter author wanting to use pandocpm don't have to manage the formula in 2 places (1 in their own repo, another in the master index in our repo)
The urls in the formula needed to fit a specific requirement: rather than pointing to a generic url that might have its content changing over time, one need to specifically point to a "static target".
The easiest way to accomplish this will be to use the url to a particular commit. e.g. https://github.com/sergiocorreia/panflute-filters/blob/ba9aa0184bc3d1fd27f2ac4922f17943c2bc9b69/filters/debug.py. Since the commit hash uses SHA-1, while not as secure as SHA-256, should still be quite secure, and can save us a lot of work.
This applies only to "simple" formula. For "pip" formula, we again just "blindly trust PyPI", which we couldn't do anything about.
The biggest drawback will be that there's no way we can just point to a branch, since a branch is a moving target. It is not a drawback of this approach however. Any kind of security we want to impose will have this problem. To sort of point to a branch, e.g.,
https://github.com/sergiocorreia/panflute/tree/1826a16d2a6a691b1e00efbf5e8d305ce948ab33 points to the last known commit of panflute/master
https://github.com/sergiocorreia/panflute/tree/9cc5148a69aca4cdd38669934dd4086c143f330f points to the last known commit of panflute/python2
In fact, in 2012 noted security researcher Bruce Schneier reported the calculations of Intel researcher Jesse Walker, who found that the estimated cost of performing a SHA-1 collision attack will be within the range of organized crime by 2018 and for a university project by 2021. Walker’s estimate suggested then that a SHA-1 collision would cost $2 million in 2012, $700,000 in 2015, $173,000 in 2018 and $43,000 in 2021. From Understanding SHA-1 Vulnerabilities — Is SSL No Longer Secure? - Entrust, Inc..
I guess it wouldn't be too worrying for our applications.
Would that involve having to host the filters/templates/etc on the packages repo? (I think linking to packages is easier and more likely to work than hosting the packages or even using git submodules)
No. Only the YAML formula will be centrally-hosted. It is very similar to how homebrew-cask host formula in this aspect.
However, I'm considering providing repositories for centralized packages (totally optional). Because I see from pandocfilters/pandoc-templates pull request, there seems a need in this area. Sometimes for very simple filters/templates, a dedicated repository seems over kill, while a single file "repository" like gist might not be up to the job. e.g. One want to have at least 3 files: the package, the markdown source, and the native from them (for tests).
Regarding native files for tests, they are not the standards (e.g. not in pandocfilters/panflute's example folders). But I think if we are to offer such centralize repositories for simple packages, I will require them to write a simple test, to minimize any extra workloads on us.
there will be a standard organization structure to make the tests full automated.
when tests fails on newer versions of pandocfilters/panflute/pandoc, the original author would be called to fix it. If no one is fixing it, it would be retired (say in an archive folder, meaning pandocpm install
also wouldn't be able to install it to prevent problems. The corresponding YAML formula can be added a warning message for such cases.).
By the way, because of the proposed security features, I think once they are implemented, panflute can be allowed to run pandocpm automatically, to make it just works.
Another related question is, do you think it is possible to implement autofilters for pandocfilters? Except the need to rewrite filters to have a main function, are there other problem?
An alternative approach of auto-filter would be, rather than having an auto-filter in panflute and panflute calling pandocpm, may be the auto-filter can be in pandocpm instead, where pandocpm lists panflute (and possibly pandocfilters) as dependency. Then the pandocpm as a filter can do everything under the hood:
pip install pandocpm
pandoc -F pandocpm ...
Edit: a way to circumvent the main
function problem is to embed the name of the main/action function in the yaml formula.
(Sorry for the late reply, it's been quite busy at the office lately)
Hosting the recipes seems interesting, as well as the test thing. One concern I have with the tests is that it's a lot of work, and might require extra dependencies (a lot of filters rely on external sources). This means authors might not want to write them.
Another related question is, do you think it is possible to implement autofilters for pandocfilters? Except the need to rewrite filters to have a main function, are there other problem?
autofilters just calls a python function, so it does not even know if it's calling a panflute or pandocfilter-based filter. So AFAIK just adding the main() function (with the correct arguments and return) should work. A problem though is that pandocfilters don't return anything (they just rely on toJSONFilter that writes to stdout).
Ah, I might not have been very clear. The 2 different kinds of centralizing is completely unrelated. Let's forget the centralized simple filters and tests for a moment (which is a separate project and still relies on the architecture below):
As far as the ability of install packages through pandocpm, the only centralization I'm proposing is the formula only. i.e. the formula is the only yaml one needs to write, and will not be besides their package, but in our centralized formula repository. (i.e. either no index, or index generated by the individual formula. Either way the package author need not to touch the index.)
What would be the smallest/simplest formula that we could require for now? (to get started)
What would be the smallest/simplest formula that we could require for now? (to get started)
Didn't think it through yet. I didn't read through all your code yet, from what I've read so far,
concerning the index:
Currently in the index, the non-simple kind has a complex structure (because if the package is non-simple, there's no obvious place to put a formula in the non-centralized formula "paradigm"). I think the complex structure should move to an individual formula now as the formulae are centralized.
it means the url-type
would be moved to individual formula as well, possibly renamed to/add a type
to denote if it is a simple package (standalone, single file) or requires other package management (e.g. pip)
since the formulae are centralized, url is not needed in the index.
After all these, it means the index is just a list of the names of packages. i.e. we might as well get rid of the index, or the index can be auto-generated, and then the index can simply be a plain text list of names, rather than in yaml.
concerning the individual formula, in addition to the existing
the complex structure from the index
url-type
and/or type
as said above
more description on the non-simple filter. Currently, the only non-simple one is pip
. It will be a problem for Unix users since the OSes ship with Python2 as the default, and it is considered a bad practice to override the default python version. Hence using pip
alone means it will use pip2
for them. So an info on the python version of the package is needed: python2, python3, or univeral, corresponds to pip2, pip3, and pip.
simple|pip|pip2|pip3
.Edit: forget what I said. I see that your earlier proposal on the formula spec already included the license
key.
This issue is split into #3, #5, #6, #7, #8, #9, since this is over-long and touched on very different issues that is hard to follow.
Filters
First, each filter needs a specific structure, as seen here and here
The key thing is the
main()
hook:Every panflute script needs to end up with this:
Or a variant of this, but always i) with a
main()
function, ii) that receives an optional argumentdoc
, which is sent to pf.run_filter (or any of run_filters, toJSONFilter, toJSONFilters), and iii)return
s the output of the callYAML for filters
Optionally, filters have an accompanying YAML file, as here: https://github.com/sergiocorreia/panflute-filters/blob/master/filters/debug.yaml
The metadata shown in the example is currently overkill, but ideally it should be used to construct a gallery of filters, search for specific filters, update them when a new version appears, etc.
Index of filters
It's a simple YAML file that points to the yaml (or .py) files: https://github.com/pandoc-extras/packages/blob/master/filters.yaml
Everything is easy to extend to things besides filters (in this case, just have a separate yaml file)
panflute autofilters
You can have metadata in the form of
panflute-filters: somefilter
orpanflute-filters: [filter1, filter2]
. Additionally, you can havepanflute-verbose: true
andpanflute-path: somepath
entries.Panflute will search in the current dir, or datapath, or the path indicated in the metadata, or $PATH, for the filter, and if found, run it.
Note: this is currently not integrated with
pandocpm
, so no auto-installs will occurDownloading with
pandocpm
After installation, type
pandocpm --help
As an example, this are some common patterns:
You can also set specific folders to install, or alternative indexes
Pending work
Edit: the checklist bubble is removed and migrated to #3.