jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.9k stars 3.39k forks source link

Fallback-search "user data directory" for defaults FILE options #5982

Closed iandol closed 3 years ago

iandol commented 4 years ago

In the defaults YAML files, for known file kinds (filters, templates etc.), Pandoc will search in the relevant subdirectory of its data-dir. But for other FILE-based options, e.g. reference-doc, it seems you cannot rely on data-dir to keep these files unless you specify an absolute path:

filters: modifyHeadings.lua # works
reference-doc: custom.docx # fails
csl: csl/apa.csl # fails

In this case, modifyHeadings.lua auto-resolves to $data-dir/filters, whereas custom.docx can only be used if it is in the current working directory, but is not additionally searched for in $data-dir . csl is another example. All this works with absolute paths, but these are fragile. One workaround solution would be to consider #5977 (being able to expand $HOME env variable), but having the pandoc user directory as a fallback search location simplifies the path descriptions even more.

jgm commented 4 years ago

Well, in general you CAN override any data file by putting it data-dir.

But reference.docx is a bit of a tricky case, because there's not really a data file reference.docx -- pandoc constructs it from a number of files in the docx directory -- essentially an unpacked reference.docx -- each of which can be individually overridden in the data dir. (Try creating docx/word/styles.xml in your data dir and then create a docx; it should use your modified styles.)

It would make sense checking first for a reference.docx (or reference.odt) in the data dir before fallig back on the default. But I think this issue only concerns those two.

iandol commented 4 years ago

Thanks John. csl also fails (when e.g. put in $data-dir/csl/apa.csl), and so does bibliography if I put my bib file in $data-dir/myrefs.bib, or citation-abbreviations etc.

With refs.yaml:

filters: 
  - pandoc-citeproc # process citations
metadata:
  bibliography: core.json #JSON faster than BIB, symlinked to Pandoc data dir
  csl: csl/apa.csl
  citation-abbreviations: cite-abbr.json # my journal abbreviations
  cite-method: citeproc
➜ pandoc -d refs -d docx -o test.docx test.md
pandoc-citeproc: PandocResourceNotFound "csl/apa.csl"
Error running filter pandoc-citeproc:
Filter returned error status 1

The error is in pandoc-citeproc, so I realise pandoc is not resolving the file first but just passing it directly... I suppose this is the same behaviour as the commandline. I think I may be spoilt from using pandocomatic which does resolve paths relative to $data-dir for its yaml config by default. You may think this is out-of-scope for Pandoc?

Regarding reference-doc, I use several different reference docs, and so can't just rely on a single reference.docx in the default location, and want different defaults files to be able to specify different reference-doc sources:

filters: modifyHeadings.lua # works
reference-doc: templates/custom-calibri.docx # fails
filters: modifyHeadings.lua # works
reference-doc: templates/custom-sourcesans.docx # fails

OK so the default docx is not actually a file, but I wonder if the reference-doc entry in defaults.yaml is a docx then can't Pandoc assume it is a file?

jgm commented 4 years ago

What I said still holds. You can override pandoc's own data files by putting things in data-dir. What you're asking for goes beyond this documented behavior: a default search path for arguments to various command-line options. (All? Just some?)

See #3212 for a related issue.

Also #5977 for a proposal that would allow you to use environment variables to specify search paths in data-dir. We could, perhaps, have an automatically defined environment variable for data-dir itself. Another possibility would be to have some field in the defaults file that allows you to set the baes directory for resolving relative paths within the defaults file.

iandol commented 4 years ago

Ha, I didn't remember I had opened #3212 several years ago on a similar theme! 😳

I think being able to specify data-dir with a pre-defined environment variable like ${PD-DATA}/ would be a workable solution.

jgm commented 4 years ago

There have been a lot of requests to have paths resolved relative to the directory containing defaults.yaml. I think we should consider that approach. I just want to make sure there are no gotchas.

jgm commented 4 years ago

Implementation note: when --defaults is encountered it simply reads the defaults file and updates the Opts structure. So one way to implement this would be by replacing the relative paths with absolute paths (in the directory containing the defaults file).

However, this wouldn't be desirable in many cases, e.g. for css it would screw up the relative URLs currently added for HTML output.

I suppose another approach would be to automatically add the directory containing defaults.yaml to the top of the resource-path, and also make sure the resource-path is always followed when handling files specified on the command line.

allefeld commented 4 years ago

Having ${PD-DATA} would cover my use case in #5977, too, and would be even more elegant than using ${HOME}.

allefeld commented 4 years ago

paths resolved relative to the directory containing defaults.yaml

But that would be ${PD-DATA}/defaults, right? Maybe I misunderstand, but that would only make sense for relative paths specified in a file in that directory. Which would generalize to "relative paths specified in a defaults file are relative to its directory". Works too, but referencing ${PD-DATA} feels cleaner to me.

Maybe we can have both? :)

But for my purposes, ${PD-DATA} is completely sufficient.

jgm commented 4 years ago

The suggestion was not to provide environment variables, but instead of default to looking for files (with relative paths) in the same directory as the defaults.yaml.

But you might be right that providing a way to do ${DATADIR} or something is better, since then you could keep defaults.yaml in the working directory but still reference files in the user data directory.

jgm commented 4 years ago

One idea I just had would be to use a custom YAML tag for this:

css: !!datadir mycss.css

In T.P.Readers.Metadata we'd then only have to modify the clause for YAML.SUnknown to check for this tag type and, if it's found, prefix the path to the datadir. This is better than globally substituting something like ${DATADIR}, which one might want to include e.g. in a code block somewhere in the metadata. Possible drawbacks: (a) not sure we could handle this if we ended up switching back from HsYAML to yaml for performance reasons. (b) what if --data-dir is specified on the command line after the defaults file? Then we'll be looking in the old data dir.

Have you all tried just adding your user data dir to the --resource-path?

iandol commented 4 years ago

Adding resource path:

resource-path: ["/Users/ian/.local/share/pandoc/"]

to the first line of the YAML default file doesn't seem to work:

➜ pandoc -drefs
This is a test [@shipp2013] to see what happens
Could not find bibliography file: Core.json
Error running filter pandoc-citeproc:
Filter returned error status 1

resource-path is also not really a solution as we still require an absolute path in the defaults file, so then I'd just use absolute paths in the options themselves.

!!datadir solution sounds great to me if it is easier to implement. I think there is nothing to be done about your (b) — that is up to the user to correctly specify IMO; but the performance problems and possible revert may be be a blocker to using this solution...

jgm commented 4 years ago

A list of defaults fields that expect file paths and don't already have a search path defined (e.g. template): output-file, input-files, metadata-files, resource-path, filters, data-dir, abbreviations, extract-media, syntax-definitions, epub-metadata, epub-fonts, epub-cover-image.

jgm commented 4 years ago

It looks like Data.Yaml.Parser from the yaml package would allow me to handle custom tags. So, I'll explore that -- I don't want to lock myself into a syntax change that can't be handled if we decide to move back to the yaml package, or provide it as a compile-time option for those who need to handle big yaml files.

dhimmel commented 4 years ago

I'm experiencing a similar issue with 2.9.2 when exporting to HTML. My defaults.yaml file contains:

include-after-body:
- plugins/anchors.html
pandoc \
  --data-dir /path/to/data-dir \
  --resource-path '.:/path/to/data-dir' \
  --defaults=defaults.yaml

/path/to/data-dir contains a file plugins/anchors.html but I get the error:

File plugins/anchors.html not found in resource path

It seems that neither --data-dir nor --resource-path are being searched for plugins/anchors.html. Basically, I have relative paths in the defaults file. How do I tell Pandoc what directories these paths are relative to?

jgm commented 4 years ago

Glancing at the code, it should be searching resource path for include-after-body. I need to investigate this.

jgm commented 4 years ago

I found the problem! The instruction to set the resource path came after the processing of include-after-body (etc.). I moved it to the beginning, and this should affect quite a few things.

jaybe-jekyll commented 4 years ago

Quick note of thanks for discussing and documenting these things ASAP as they are identified. I had been pulling my hair out for $a_long_time and now realize this will be fixed within the next release. Thanks! Until then... e.g.

resource-path: [my_resource_path/include-before-header.html]

brainchild0 commented 4 years ago

A few thoughts from throughout the discussion:

iandol commented 4 years ago

For my use case, and I think that of @allefeld, being able to search the pandoc data directory without specifying it explicitly would be sufficient. I'm trying not to declare paths in YAML, but simply access the existing data directory structure that Pandoc already uses for many files, extended to things like CSL for bibliographies. The Pandoc data directory contains the defaults files in a standard configuration, and already provides a search path for many support files. Though the easiest solution would be just to add it to the search path for all files that Pandoc tries to find (that is what pandocomatic does[1]), It was sort of agreed that requiring an ENV like ${DATADIR} would be more general, or a tag like !!datadir as an alternative way to implement it. Having access to ${DATADIR} or !!datadir would not preclude longer term solutions.


[1] one path to rule them all, and in the darkness bind them ;-)

brainchild0 commented 4 years ago

For my use case, and I think that of @allefeld, being able to search the pandoc data directory without specifying it explicitly would be sufficient.

Yes, I understand this from the discussion. The reason for my notes is that it's important that a solution can generalize to wider uses as well as make space for ongoing expansion of capabilities. As this happens, more users adopt the system, and new use cases emerge, and so on.

I think the suggestions I made, in their strictest form, affect you directly by adding verbosity to your use case. But in totality I think the effect may be beneficial, perhaps at a slight cost to you.

(Also, it's possible to add a field in the defaults file to toggle search behavior. [e.g. search-missing-files: yes]. It might be a bit ugly, but still completely reasonable if the use case is compelling.)

It was sort of agreed that requiring an ENV like ${DATADIR} would be more general, or a tag like !!datadir as an alternative way to implement it. Having access to ${DATADIR} or !!datadir would not preclude longer term solutions.

I think it would be hard integrate a more general solution once either of these approaches is adopted, without first removing it. Trying to make a coordinated long-term solution coexist with one of these is likely to introduce problems that are difficult to solve but simple to avoid.

I suggest a better approach might be to adopt an extensible and robust strategy, even if the initial implementation is limited.

In the idea I proposed, the general unit of interpolation is an expression of two parts, a namespace and an item in the namespace. The use of the namespace makes it easy to transition incrementally from one category to several categories of items, without any serious migration or coexistence of two paradigms of solution. Use of an common template system captures the features and familiarity of such a system both immediately and in the distant future. In the case of an environment variable, the sequence would be one similar to {{env.DATADIR}}. It is not vastly different from ${DATADIR}, but also leaves substantially more space for expansion. It does not express an assumption that environment variables are the only category of items, nor does it leave any difficulty later choice about how to add another category. Someone may complain about the slightly greater verbosity, but I fail to see why such a slight difference should be prohibitive in light of the other considerations.

brainchild0 commented 4 years ago

Going through once again, I realize I'm not fully understanding the case.

Would someone clarify this statement?:

The Pandoc data directory contains the defaults files in a standard configuration, and already provides a search path for many support files.

Why is overriding certain application files but finding others a concern? Perhaps more generally, why is the issue of the user providing values being expressed as a problem of finding files? Could the application simply collect all the pieces of itself, then subsequently apply all the pieces of the user project, based on the contents of the files, without worrying about which files get found and which get not found, or overridden, simply on the basis of their paths?

iandol commented 4 years ago

The Pandoc data directory is a folder that Pandoc always knows about, and that already stores many files that are useful during conversion (filters, templates etc.).

I do think that your templating proposal is more extensible and elegant, and I'd be happy to be able to use {{env.DATADIR}} alongside the other templating options. But who is going to implement it? There is lots of core work that fully occupies the main developers, and not too many people with the requisite Haskell skills to add something like this. Pandoc feature development is slow and deliberative, and pragmatically this will take years if ever to make a reality. Allowing access to the pandoc data dir in the meantime with a tag is a pragmatic option. This does not stop a templating system being developed in the future, when and if a developer can take this on.

brainchild0 commented 4 years ago

@iandol: Your observations are sound, and incorporate a broader but no less important range of considerations compared to mine. The only further observation I would add is that in the special case of environment variables, the syntax I gave could be emulated with equal ease as could be implemented any other. If the proposal did generate enthusiasm, then adopting the syntax could be a first small step toward overall implementation, such as to avoid leaving any legacy along the way.

(I could push back a little on the pessimistic time frame, however, unless you're talking about the latency associated with the backlog of other features.)

The usecase: being able to share defaults with people who are not so comfortable with Pandoc by using a single resource folder to simplify install without editing files.

Is there any example or brief explanation that better illustrates why specifically a search path is relevant for function or non-trivial convenience compared to referring directly to each resource file?

iandol commented 4 years ago

I could push back a little on the pessimistic time frame, however, unless you're talking about the latency associated with the backlog of other features

Well, I'm happy if it turns out your optimism overturns my pessimism 👍 — but there are many more important issues that have taken years of deliberation and are still not yet implemented. This is not a criticism of Pandoc or the absolutely amazing work done by the core developers, just the hard cold reality of a powerful tool created for free.

Is there any example or brief explanation that better illustrates why specifically a search path is relevant for function or non-trivial convenience compared to referring directly to each resource file?

As an example I share a bunch of defaults files to help people with their writing workflows here: https://github.com/iandol/dotpandoc/tree/master/defaults — ideally they should just be able to download my pandoc data directory and not have to worry about having to re-edit all the paths manually. But unless their name also happens to be ian or that is what they set their home folder name to, they'll have to find and edit paths in all the defaults files. Not impossible of course, but I already have had many Scrivener users who would really really benefit from using Pandoc in their workflow, but who are already intimidated by the terminal where every additional step makes them feel overwhelmed. TL;DR - it is easier to share files when they are self-contained.

brainchild0 commented 4 years ago

@iandol: Your explanation succinctly captures the very sort of usage that I hoped to facilitate when I created the feature proposal of which John implemented a simplified form and named defaults.

Your case, if I understand it correctly, seems to be a form of what I considered as the standard base case, to put some files in a directory, one of which being what I originally called the project file, to be processed by the application in an entirely accessible and reproducible way.

Would it not be a complete solution if the paths were resolved relative to the file that provides them, or are your requirements more substantial?

iandol commented 4 years ago

For the case I presented (users who are taking advantage of Pandoc triggered from another piece of software[1]), the markdown file is the only "project file" that gets exported, and other files that do not change are not generated by that application. The idea is to keep unchanging CSL files, bibliography files, reference docs, templates, filters, defaults in a centralised place known to Pandoc.

Are you suggesting the working directory or the folder that the defaults file is found is automatically added?


[1] that itself is not concerned with deliberately supporting Pandoc

brainchild0 commented 4 years ago

@iandol: I am simply asking whether your immediate needs might be satisfied if the application were changed so that relative paths appearing in a defaults file were resolved relative to the location of that file rather than to the current working directory. This question was prompted by your hope, which I fully understand, that the user could download a directory representing an existing workflow and then simply run an operation, that behaves predictably, without editing any defaults files.

(In case the above still remains unclear, then consider another way to put it. If the application had had such a design before you began the project you describe, and also, for the sake of argument, it never had a data-dir field, might you have found an organization for your project that realizes the advantages you are currently seeking? What problems or inconveniences would remain?)

iandol commented 4 years ago

Yes, if there was no data-dir, then it would be possible to set up a structure relative to the defaults-dir. But what advantage does this bring, considering we do have a data-dir and I will nevertheless ask users to install templates, filters and defaults files themselves relative to this folder anyway (as this is the official place for such files)? This doesn't have to be either/or anyway, as Pandoc could have default path resolution, similar to the shell system path:

$WORKING_DIR:$DATA-DIR:$DEFAULTS-DIR

If a file is not in the working directory, then check relative to the data directory, and if it isn't there then relative to the defaults directory. That would satisfy more usecases.

jaybe-jekyll commented 4 years ago

A typical and simple use case for me is desiring the data directory to first check the current working directory for the presence of data directory type files.

That of course could run into issues however, as typical folders named such as "defaults", etc. may overlap with other systems being employed, like a static site generator, or a script routine, etc.

So there would likely need to be a way to override and specify to include or not include the current working directory as the data directory.

Lastly, an obvious way to state the current working directory is ./ is to simply use the switch, --data-dir . ... but I find myself feeling uncomfortable having to always call it, and for those I share code and frameworks with, they don't remember or don't understand, etc.

brainchild0 commented 4 years ago

@iandol: My inquiry has been primarily directed at improving the use case that prompted your request in the most expedient way that you would consider satisfactory. What you consider satisfactory depends subjectively on you, of course, so any idea proposed by someone else carries a possibility of falling outside the intended scope. Nevertheless, I had considered a handful of reasons why it seemed worthwhile to solicit your response about relying on relative paths:

Further to your immediate concerns, several broader ones point toward a discord between search paths, which are quick to produce but fickle at the boundaries, compared to the more verbose but stable effect of relative or interpolated paths, which might better suit the project-oriented paradigm to which defaults files belong. I understand that you may well regard such discussion as beyond your current scope of interest.

iandol commented 4 years ago

I don't disagree with anything you say. If the search path always includes the path relative to the defaults file specified, then I'll just use ../ and make sure that I specify all defaults files relative to the pandoc data directory — this is a workable solution for me. But as Jon mentioned, there is a usecase for being able to access the pandoc data directory and specify files relative to that. If I wasn't clear before — I do not "move" the data directory using --data-dir, but simply use the default location that Pandoc itself generates when installed. I prefer to use that so I extend upon the default locations that Pandoc already specifies for files. Both scenarios use relative paths as far as I can see, we are simply specifying where the "anchor" for that relative path originates (and it could include both). I would be happy with YAML tags like !!datadir or ENV vars if the Pandoc developers do not want to change the existing search path resolution behaviour. I'd also be happy if someone implemented your template proposal. Many roads lead to Rome 😄 …

brainchild0 commented 4 years ago

@iandol: To be clear, the basic form of the suggestion is to use the location of the file as the single path for resolving files, the same as the CWD is now. Incorporating this location into some search sequence is a further embellishment of the basic case.

A tension seems to emerge, because the request is framed as relating to overriding defaults, but the details rather suggest to me a composite project. You may be using a mechanism intended for the former, attempting to resolve the latter, and finding an incompatibility.

You want users, it seems, to add their own contribution to a project you supply. We have a mechanism for such cases, though slightly clumsy in its present form, the aggregation of options from a sequence of project files. If your project would include a subdirectory serving as a template, and including a separate defaults file, then the user could customize the files in a copy of the template to a preferred location, without needing to edit the defaults file. This new directory would serve as you have intended the data directory.

The clumsier side of this approach is that the user would need to pass two defaults files to a single invocation of the application.

The advantage is flexibility, including your ability to provide multiple templates and the user's ability to maintain multiple version of the customization, and in any locations.

More broadly, the advantage is not relying on the adoption of a feature that raises, for many, concerns over introducing adverse effects, and which is even in your case incomplete, as for citations passed to the filter.

iandol commented 4 years ago

You want users, it seems, to add their own contribution to a project you supply

Not exactly, I offer my Pandoc data directory as a starting point for others to expand upon, as simple as that. As I mention, I personally don't use defaults files as pandocomatic still offers significant advantages over and above solving my path problems. I prefer to keep all my files relating to Pandoc in a single place (which is the place Pandoc created when it installed itself).

To be clear, the basic form of the suggestion is to use the location of the file as the single path for resolving files, the same as the CWD is now.

I don't find any compelling reason to move out of the pandoc data directory to deal with a separate "project" directory that itself is separate from the CWD, and perhaps that is because I still don't understand what the advantage is to make defaults-dir the one true path?

brainchild0 commented 4 years ago

You want users, it seems, to add their own contribution to a project you supply

Not exactly, I offer my Pandoc data directory as a starting point for others to expand upon, as simple as that.

The difference may be only in language. I think two defaults files could produce an outcome similar to what you are seeking, putting aside for the moment the question of the simplicity of the command.

I don't find any compelling reason to move out of the pandoc data directory to deal with a separate "project" directory that itself is separate from the CWD,

The advantage of not using the CWD is to achieve consistent operation regardless of working directory. In the case of multiple defaults files, it causes relative paths in each defaults file to be resolved differently, as would be needed for two separate trees in separate locations. Regarding the data directory, it seems to be intended for usage different from what you initially suggested, which was to apply to a list of multiple searched directories the current treatment of CWD. Many may find counter-intuitive the idea of spreading a project over several directories, but omitting indication of any specific directory from the path given for each file. It's not how a project is organized in a general context.

brainchild0 commented 4 years ago

Also #1191 and #3635.

jgm commented 3 years ago

I'm currently inclining towards the following simple solution: in defaults files, and only for those fields that expect file paths as arguments, we expand ${HOME} and ${USERDATA}:

input-files
output-file
defaults
template
include-before-body
include-after-body
include-in-header
resource-path
bibliography
csl
citation-abbreviations
filters
data-dir
log-file
abbreviations
syntax-definitions
reference-doc
html-math-method / url
epub-metadata
epub-fonts
epub-cover-image

I thought about the more general "templating" solution, but there are a few issues with that: for one thing, many of the proposed template variables can depend on the contents of the default file itself.

The simple solution would be easy to implement and should resolve the problems people have had.

brainchild0 commented 3 years ago

I thought about the more general "templating" solution, but there are a few issues with that: for one thing, many of the proposed template variables can depend on the contents of the default file itself.

I suggest trying to find a solution that is extensible toward more general templates, even if any adoption of such an approach would not be immediate. No reason is evident why an extensible approach is not feasible even while seeking a less complicated solution in the foreseeable future.

What is the conflict of needing to evaluate variables through the contents of the file referencing them? A problem of such sort would appear to be soluble in general by multi-pass processing. Do you view the issue as one of implementation complexity, or do you see a more general conceptual difficulty?

iandol commented 3 years ago

@jgm — your proposed solution, adding ${HOME} and ${USERDATA} for those defined fields only would certainly solve the issue I and I think others (@allefeld etc.) raised in a simple and elegant manner!

jgm commented 3 years ago

I think I'll allow all environment variables (not just HOME). This gives a lot of flexibility.