jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.68k stars 3.39k forks source link

Direct URL support for `--lua-filter` #6760

Open salim-b opened 4 years ago

salim-b commented 4 years ago

This is a small feature request: I think it would be very useful if Pandoc's --lua-filter argument directly supported URLs pointing to .lua filters the same way as --template does.

So one could call Pandoc like this

pandoc --lua-filter=https://raw.githubusercontent.com/pandoc/lua-filters/master/include-files/include-files.lua \
       --output result.html \
       main.md

instead of first having to manually (re-)download include-files.lua and calling

pandoc --lua-filter=include-files.lua \
       --output result.html \
       main.md

This would ensure that always the latest version of a filter available at a specific URL is used.

jgm commented 4 years ago

Even if pandoc allowed this, I wouldn't recommend doing this sort of thing, because of the security implications. If the upstream filter gets updated in a malicious or unintentionally destructive way, you'd be vulnerable. Remember, unlike a template, a lua filter can do virtually anything -- it could, for example, read your passwords and upload them to an external URL.

ickc commented 4 years ago

If someone want to do this, you can point it to an url of a particular commit and making it much more secure.

jgm commented 4 years ago

Yes, that's true, and for that reason we might want to consider allowing URLs for filters.

jgm commented 4 years ago

@tarleb If we did want to allow this, we'd probably want to use fetchItem in applyFilters, and then deliver the contents of the filter, rather than the file path, to the apply functions for Json and Lua filters respectively. We'd also need to check to see if the file (if it is a file) is executable, since Filters.JSON.apply needs to know that. I'd be curious about your thoughts on this.

alerque commented 4 years ago

I wouldn't recommend doing this sort of thing, because of the security implications.

Again if you allow this for templates there is no reason not to allow it for filters. If you think it's a security issue that should be blocked for filters you should block it for templates as well. The code path to exploit this is a bit more convoluted in the case of templates but it is quite doable.

Personally I suggest allowing both but only with an additional --unsafe-url-fetch flag to clue people in that they are responsible for their own actions. Without the flag only allow local files, with the flag allow URLs in both places.

jgm commented 4 years ago

Again if you allow this for templates there is no reason not to allow it for filters

But I gave a reason above for distinguishing the two cases. Filters can launch missiles. Templates can't. The mischief you can do with a malicious template is quite limited.

alerque commented 4 years ago

Filters can launch missiles. Templates can't.

Can't they? I believe templates for some formats can. The limitation is only relevant to some output formats.

tarleb commented 4 years ago

I think that's an edge case; LaTeX templates can launch missiles, but only if --pdf-engine-opt=-shell-escape is set. But this is possible even from within a plain Markdown document (`\directlua{os.execute 'launch missiles'}`{=latex}). Do templates allow more?

For other formats, the problems start when the output file is opened by another program (macros in a Word file could be a possible example). I'd argue that this is a security problem of those programs, not pandoc.

tarleb commented 4 years ago

I gave it more thought but am not enthusiastic about the idea. My gut feeling is that this should be solved via a wrapper or maybe a filter. Some points which led me to that assessment:

It's not that I'm fundamentally opposed to the proposal, and I do think it would add value. It is just that I'd much rather we'd use LuaRocks or another package manager to keep filters up to date.

This also makes me think that maybe the pandoc ecosystem could use a more defined concept of "package". Oftentimes, filters, defaults, and templates are bundled together and intended to be used in combination. So maybe what we really need are pandoc packages?

ickc commented 4 years ago

I completely agree with @tarleb. This is the territory of a pandoc manager and if pandoc does that I think it is trying to be too smart.

I hope we can have more discussion (either here or separately in pandoc-discuss) about pandoc package manager. We started talking about this and there was even a prototype pandocpm lying somewhere in our pandoc-extras organization. But the problem I found is that the idea of the pandocpm is still trying to be too smart, and managing executables can be uneasy (in terms of potential security issues.)

Even the auto-fetching of templates is a pandoc package manager problem (which is part of what is considered in pandocpm too.) In the end one want to be able to have reproducibility in authoring pandoc documents, and now it is very hard for a number of reasons including not having a pandoc package manager (and also an index, like those in TeX.)

There are at least 2 options:

Among data Scientists conda is very common, and can be useful here. As a start, it already has pandoc, pandocfilters, panflute packaged there. conda is a real package manager so that it can manage the dependencies (in principle the panflute package can and should requires pandoc<2.10. The machinery is there but the maintainer of the "formula" in conda-forge hasn't do that.) And importantly, conda is cross-platform: Windows, macOS, Linux; x86, x64, aarch64, even other architectures, are supported (more than the supported platforms in pandoc), making it a very good candidate for a cross-platform pandoc package manager.

Lua filter system has taken off and is the most "native" way of implementing pandoc filters, so if we can take advantage of a package manager there, it would be the most natural and more self-contained then the above solution. Lua experts can say more about this!

On the tangential of this topics, it would be nice to have something like \usepackage in pandoc, perhaps in the metadata field, to make the document more self-contained.

jgm commented 4 years ago

Oftentimes, filters, defaults, and templates are bundled together and intended to be used in combination. So maybe what we really need are pandoc packages?

How far does --data-dir get you in that direction? What if we adopted the convention that if the user data directory contains a file defaults.yaml, it is automatically used as a defaults file? Then a directory containing an appropriate defaults.yaml, templates, and filters could serve as a "package" -- one would only need to point to it using --data-dir. Alternatively, we could introduce --package, which works like --data-dir but triggers the implicit defaults file as described above. (It could, more controversially, be made to search a canonical local package repository, and download the package from a blessed remote repository if it is not found there.)

tarleb commented 4 years ago

Wow, an option like --package sounds great! Would it make sense to also accept zip archives as an alternative to directory paths?

tarleb commented 4 years ago

A good resource on package updating is The Update Framework. Haskell's cabal program uses it as its security model, as does conda. Maybe it would be a good GSoC project to explore this further in the context of pandoc?

I like the idea of using conda, but that would leave out (or require extra steps from) R Markdown users. Packing for both R and conda would probably serve 95% of all pandoc users. That's not bad! Probably good to leave the details to the respective communities?

cagix commented 4 years ago

A (consistent) --data-dir or --package would be great :)

As of now, I have to use --resource-path for certain resources and also --data-dir for other resources. So, from a user perspective it would be desirable to have just the one option to specify a folder containing filters/, resources/, defaults/, ... subdirs.

jgm commented 4 years ago

I think supporting zipped directories with a potential --package option makes a lot of sense....and is not hard.

I have never used conda, but it purports to be multiplatform and language-agnostic, so maybe it would work. However, getting into the business of providing remote packages opens a lot of cans of worms. Do we take responsibility for auditing the packages in the repository so they don't include vectors for malware (remember, filters can do anything)? If so, that's potentially a lot of work and a lot of responsibility. If not, the proposed feature might end up being a vector for bad stuff. Rather than providing a central repository, one might just go one step beyond what we have with filters: you can check out someone's package repository as a directory, audit it yourself, and use it at your own risk.

In any case, one key missing piece is a way to address file paths relative to the package directory in defaults files and perhaps YAML metadata (see #5982, #5977). (E.g., you might include in your package a logo which gets referred to in a template; the template wouldn't need to refer to it directly if it could be set in a variable in a defaults file, but the defaults file would have to be able to specify the path relative to the package directory.)

ickc commented 4 years ago

I like the idea of using conda, but that would leave out (or require extra steps from) R Markdown users.

R packages can be managed using conda: https://docs.anaconda.com/anaconda/user-guide/tasks/using-r-language/ . But I don't know if a typical R programmer is going to use it that way. Also, Rmarkdown while uses pandoc are kind of its own thing so even if pandoc try to support them I don't know if they would be on board to this "vanilla pandoc" ecosystem (they already have their own package manager for example.)

I don't know if we are over-thinking about the responsibility of auditing the code. It should be something sort of like CTAN or PyPI. In principle anyone can upload any code there, and in the past malwares do exist at least in PyPI. It would be unreasonable to ask them to audit every piece of code before letting maintainers releasing it. And neither should the "pandoc packaging index" does that.

In the past we tried to build a 3rd party filter repo but it is hard without the official pandoc's blessing. But basically this is what is done in pandoc/lua-filters: a repo of pandoc filters with user contribution.

Below are 2 different directions...

Built our own package manager

There's 2 examples we can expand something like lua-filters repo to something bigger but still rely on voluntary based contributions: homebrew and conda-forge. The main difference is that in homebrew, all "formula" (a recipe to obtain a package) lives in one single monolithic repo, and conda-forge has each "feedstock" (recipe) in their own separate repo. I think conda-forge's feedstock model makes more sense for us, something like:

  1. create an organization (or just use the GitHub Organization pandoc or pandoc-extras) and announce it as an official pandoc package index
  2. ask everyone to submit their filters there, with a certain guidelines and requirements (say naming conventions, perhaps a yml file for metadata describing the filter)
  3. the first time someone is submitting, obviously they need to join the organization, that's the primary level of defense—if the user looks legit, perhaps by providing some example, they are allowed to join.
  4. make it clear that when installing a filter, you are trusting the author, not the index, which is what happens in other indices such as PyPI
  5. in the event of a security problem, the maintainer revoke their access and announce it in the community.

Using existing tool such as conda

Conda on one hand seems overkill. I think the momentum needed to kick start it would be much larger. It is a hassle for most simple things used in pandoc such as a template or a single file, self-contained filter.

But there is another kind of filter, complex, multi-module, imported multiple 3rd party libraries. A pandoc package manager cannot effectively manage this. With 3rd party dependencies (such as pandas, XLsxWriter, etc.) which might have their own dependencies, it is basically next to be impossible to be managed in a homegrown pandoc package manager. But it is exactly the kind of applications conda is built for.

Conda can manage any dependencies, not only Python. In fact it is the reason why pip is not good enough and Guido told them they should build their own. It is because there are many code based in Scientific computing that has FORTRAN, C, C++, etc. dependence, conda built for those cases and can therefore handle those dependencies too. R, Julia, etc. can also be installed in conda. However, it doesn't has Haskell related toolchain yet.

After writing this, I think a home-grown pandoc package manager together with an index makes most sense in most cases. A fall back would be conda for the more complicated situations. It would be great if it has some sort of official blessing (such as letting these external conda packages appears in the pandoc package index where there's only metadata directing people to install them using conda.)

mb21 commented 4 years ago

To address, the relative paths issues, etc. maybe the --package option shouldn't take an argument, but require the user to cd into the directory first? Similar to how you have to cd into a project dir with most package managers before you can add dependencies to that project..

(Speaking of which, I guess nobody here has used nix extensively..? From what I hear, the learning curve is quite steep, but truly reproducible builds are a wonderful thing..)

So I imagine this somehow like cd mydir; pandoc --package, which would then also sandbox pandoc to only access files in that dir or its subdirs (similar to #5045). Not sure there would be any way to invoke LaTeX or filters then though...? Or are we better off leaving that kind of functionality to Docker?

ickc commented 4 years ago

I think conda probably can do what nix does. Eg conda is reproducible such as building on conda forge uses the compilers given there, not the OS's (but again it doesn't have Haskell.)

Nix may have a problem of installing in user specified path. I heard that the macOS read only root has problem for nix on macOS in Catalina because of that.

Conda has an interesting approach to solve that, that guarantee that wherever the prefix is, it will run correctly.

In terms of learning curve, don't know which is harder, but conda is not easy except for the simplest case (of PyPI compatible packages.)