Open salim-b opened 4 years ago
Even if pandoc allowed this, I wouldn't recommend doing this sort of thing, because of the security implications. If the upstream filter gets updated in a malicious or unintentionally destructive way, you'd be vulnerable. Remember, unlike a template, a lua filter can do virtually anything -- it could, for example, read your passwords and upload them to an external URL.
If someone want to do this, you can point it to an url of a particular commit and making it much more secure.
Yes, that's true, and for that reason we might want to consider allowing URLs for filters.
@tarleb If we did want to allow this, we'd probably want to use fetchItem
in applyFilters
, and then deliver the contents of the filter, rather than the file path, to the apply
functions for Json and Lua filters respectively. We'd also need to check to see if the file (if it is a file) is executable, since Filters.JSON.apply needs to know that. I'd be curious about your thoughts on this.
I wouldn't recommend doing this sort of thing, because of the security implications.
Again if you allow this for templates there is no reason not to allow it for filters. If you think it's a security issue that should be blocked for filters you should block it for templates as well. The code path to exploit this is a bit more convoluted in the case of templates but it is quite doable.
Personally I suggest allowing both but only with an additional --unsafe-url-fetch
flag to clue people in that they are responsible for their own actions. Without the flag only allow local files, with the flag allow URLs in both places.
Again if you allow this for templates there is no reason not to allow it for filters
But I gave a reason above for distinguishing the two cases. Filters can launch missiles. Templates can't. The mischief you can do with a malicious template is quite limited.
Filters can launch missiles. Templates can't.
Can't they? I believe templates for some formats can. The limitation is only relevant to some output formats.
I think that's an edge case; LaTeX templates can launch missiles, but only if --pdf-engine-opt=-shell-escape
is set. But this is possible even from within a plain Markdown document (`\directlua{os.execute 'launch missiles'}`{=latex}
). Do templates allow more?
For other formats, the problems start when the output file is opened by another program (macros in a Word file could be a possible example). I'd argue that this is a security problem of those programs, not pandoc.
I gave it more thought but am not enthusiastic about the idea. My gut feeling is that this should be solved via a wrapper or maybe a filter. Some points which led me to that assessment:
tls
package, a Haskell implementation of the TLS protocol. I don't think the library has ever been audited. While memory safety problems can probably be ruled out, I'm slightly worried about timing attacks. (But I'd admittedly be very surprised to learn that this would be exploitable in the case of pandoc.)http
and https
suddenly becomes very important, as the former makes man-in-the-middle attacks all but trivial.It's not that I'm fundamentally opposed to the proposal, and I do think it would add value. It is just that I'd much rather we'd use LuaRocks or another package manager to keep filters up to date.
This also makes me think that maybe the pandoc ecosystem could use a more defined concept of "package". Oftentimes, filters, defaults, and templates are bundled together and intended to be used in combination. So maybe what we really need are pandoc packages?
I completely agree with @tarleb. This is the territory of a pandoc manager and if pandoc does that I think it is trying to be too smart.
I hope we can have more discussion (either here or separately in pandoc-discuss) about pandoc package manager. We started talking about this and there was even a prototype pandocpm lying somewhere in our pandoc-extras organization. But the problem I found is that the idea of the pandocpm is still trying to be too smart, and managing executables can be uneasy (in terms of potential security issues.)
Even the auto-fetching of templates is a pandoc package manager problem (which is part of what is considered in pandocpm too.) In the end one want to be able to have reproducibility in authoring pandoc documents, and now it is very hard for a number of reasons including not having a pandoc package manager (and also an index, like those in TeX.)
There are at least 2 options:
Among data Scientists conda is very common, and can be useful here. As a start, it already has pandoc, pandocfilters, panflute packaged there. conda is a real package manager so that it can manage the dependencies (in principle the panflute package can and should requires pandoc<2.10. The machinery is there but the maintainer of the "formula" in conda-forge hasn't do that.) And importantly, conda is cross-platform: Windows, macOS, Linux; x86, x64, aarch64, even other architectures, are supported (more than the supported platforms in pandoc), making it a very good candidate for a cross-platform pandoc package manager.
Lua filter system has taken off and is the most "native" way of implementing pandoc filters, so if we can take advantage of a package manager there, it would be the most natural and more self-contained then the above solution. Lua experts can say more about this!
On the tangential of this topics, it would be nice to have something like \usepackage
in pandoc, perhaps in the metadata field, to make the document more self-contained.
Oftentimes, filters, defaults, and templates are bundled together and intended to be used in combination. So maybe what we really need are pandoc packages?
How far does --data-dir
get you in that direction? What if we adopted the convention that if the user data directory contains a file defaults.yaml
, it is automatically used as a defaults file? Then a directory containing an appropriate defaults.yaml
, templates, and filters could serve as a "package" -- one would only need to point to it using --data-dir
. Alternatively, we could introduce --package
, which works like --data-dir
but triggers the implicit defaults file as described above. (It could, more controversially, be made to search a canonical local package repository, and download the package from a blessed remote repository if it is not found there.)
Wow, an option like --package
sounds great! Would it make sense to also accept zip archives as an alternative to directory paths?
A good resource on package updating is The Update Framework. Haskell's cabal
program uses it as its security model, as does conda
. Maybe it would be a good GSoC project to explore this further in the context of pandoc?
I like the idea of using conda
, but that would leave out (or require extra steps from) R Markdown users. Packing for both R and conda would probably serve 95% of all pandoc users. That's not bad! Probably good to leave the details to the respective communities?
A (consistent) --data-dir
or --package
would be great :)
As of now, I have to use --resource-path
for certain resources and also --data-dir
for other resources. So, from a user perspective it would be desirable to have just the one option to specify a folder containing filters/
, resources/
, defaults/
, ... subdirs.
I think supporting zipped directories with a potential --package
option makes a lot of sense....and is not hard.
I have never used conda, but it purports to be multiplatform and language-agnostic, so maybe it would work. However, getting into the business of providing remote packages opens a lot of cans of worms. Do we take responsibility for auditing the packages in the repository so they don't include vectors for malware (remember, filters can do anything)? If so, that's potentially a lot of work and a lot of responsibility. If not, the proposed feature might end up being a vector for bad stuff. Rather than providing a central repository, one might just go one step beyond what we have with filters: you can check out someone's package repository as a directory, audit it yourself, and use it at your own risk.
In any case, one key missing piece is a way to address file paths relative to the package directory in defaults files and perhaps YAML metadata (see #5982, #5977). (E.g., you might include in your package a logo which gets referred to in a template; the template wouldn't need to refer to it directly if it could be set in a variable in a defaults file, but the defaults file would have to be able to specify the path relative to the package directory.)
I like the idea of using
conda
, but that would leave out (or require extra steps from) R Markdown users.
R packages can be managed using conda: https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-language/ . But I don't know if a typical R programmer is going to use it that way. Also, Rmarkdown while uses pandoc are kind of its own thing so even if pandoc try to support them I don't know if they would be on board to this "vanilla pandoc" ecosystem (they already have their own package manager for example.)
I don't know if we are over-thinking about the responsibility of auditing the code. It should be something sort of like CTAN or PyPI. In principle anyone can upload any code there, and in the past malwares do exist at least in PyPI. It would be unreasonable to ask them to audit every piece of code before letting maintainers releasing it. And neither should the "pandoc packaging index" does that.
In the past we tried to build a 3rd party filter repo but it is hard without the official pandoc's blessing. But basically this is what is done in pandoc/lua-filters: a repo of pandoc filters with user contribution.
Below are 2 different directions...
There's 2 examples we can expand something like lua-filters repo to something bigger but still rely on voluntary based contributions: homebrew and conda-forge. The main difference is that in homebrew, all "formula" (a recipe to obtain a package) lives in one single monolithic repo, and conda-forge has each "feedstock" (recipe) in their own separate repo. I think conda-forge's feedstock model makes more sense for us, something like:
Conda on one hand seems overkill. I think the momentum needed to kick start it would be much larger. It is a hassle for most simple things used in pandoc such as a template or a single file, self-contained filter.
But there is another kind of filter, complex, multi-module, imported multiple 3rd party libraries. A pandoc package manager cannot effectively manage this. With 3rd party dependencies (such as pandas, XLsxWriter, etc.) which might have their own dependencies, it is basically next to be impossible to be managed in a homegrown pandoc package manager. But it is exactly the kind of applications conda is built for.
Conda can manage any dependencies, not only Python. In fact it is the reason why pip is not good enough and Guido told them they should build their own. It is because there are many code based in Scientific computing that has FORTRAN, C, C++, etc. dependence, conda built for those cases and can therefore handle those dependencies too. R, Julia, etc. can also be installed in conda. However, it doesn't has Haskell related toolchain yet.
After writing this, I think a home-grown pandoc package manager together with an index makes most sense in most cases. A fall back would be conda for the more complicated situations. It would be great if it has some sort of official blessing (such as letting these external conda packages appears in the pandoc package index where there's only metadata directing people to install them using conda.)
To address, the relative paths issues, etc. maybe the --package
option shouldn't take an argument, but require the user to cd
into the directory first? Similar to how you have to cd into a project dir with most package managers before you can add dependencies to that project..
(Speaking of which, I guess nobody here has used nix extensively..? From what I hear, the learning curve is quite steep, but truly reproducible builds are a wonderful thing..)
So I imagine this somehow like cd mydir; pandoc --package
, which would then also sandbox pandoc to only access files in that dir or its subdirs (similar to #5045). Not sure there would be any way to invoke LaTeX or filters then though...? Or are we better off leaving that kind of functionality to Docker?
I think conda probably can do what nix does. Eg conda is reproducible such as building on conda forge uses the compilers given there, not the OS's (but again it doesn't have Haskell.)
Nix may have a problem of installing in user specified path. I heard that the macOS read only root has problem for nix on macOS in Catalina because of that.
Conda has an interesting approach to solve that, that guarantee that wherever the prefix is, it will run correctly.
In terms of learning curve, don't know which is harder, but conda is not easy except for the simplest case (of PyPI compatible packages.)
This is a small feature request: I think it would be very useful if Pandoc's
--lua-filter
argument directly supported URLs pointing to.lua
filters the same way as--template
does.So one could call Pandoc like this
instead of first having to manually (re-)download
include-files.lua
and callingThis would ensure that always the latest version of a filter available at a specific URL is used.