How to accommodate programmatic metadata alternations?

dhimmel commented 4 years ago

expanding https://github.com/manubot/manubot/issues/187#issuecomment-578773875 into an issue

In certain cases, it makes sense for users to enter only a subset of the final metadata that is needed by Pandoc filters and templates, and have a program auto-complete metadata.

For example, the following approaches be convenient for users and help avoid error-prone data duplication:

assume author is a key-value object. If author.orcid is set, auto-complete missing author fields that can be retrieved from the ORCID API like author.name, author.email, author.affiliations.
assume author affiliations are described via a alphnumeric key or even inline. Add an affiliations object with numbered affiliations for use in frontmatters.
assume license is a key-value object. If license.spdx is set, detect license details from the SPDX API, such as name, URL, full text.
adding metadata that the user doesn't explicitly provide a seed value for at all. For example, the commit hash of HEAD if executed within a git repository.

Do we need to make our schema aware of auto-completion / auto-population? Do we need multiple schema, like user-schema that describe what the user should input rather than the final output-schema? Should output-schema be a superset of input-schema such that auto-complete/populate only fills in additional values but does not delete any existing values?

@tarleb any general thoughts?

tarleb commented 4 years ago

Do we need to make our schema aware of auto-completion / auto-population? Do we need multiple schema, like user-schema that describe what the user should input rather than the final output-schema? Should output-schema be a superset of input-schema such that auto-complete/populate only fills in additional values but does not delete any existing values?

Two schemas seems like a good idea. I would prefer the output schema to be mostly independent of the input schema, which should give us more flexibility. Automatically populated fields could be marked as optional (or rather: not be marked as required), and I would like to see them included in the schema.

Would it make sense to develop the auto-population scripts here as well? Pandoc's Lua is currently lacking appropriate support to deal with web APIs (unless we get to do this GSoC project). Maybe python?

jcolomb commented 4 years ago

I think it makes little sense, because there is very little things that can be completely automated, unfortunately. And when it can, users will want to proof read the results most of the time. As an example, affiliation from orcid is very difficult/impossible: you will get multiple affiliation per users, and the right one might be missing. The right one is also not the latest one, because author should indicate the affiliation they had when they did the job (which can be years before the manuscript is written). etc, etc

So I would just work on the output-schema one wants, and if some tools can autocomplete stuff, it needs to be done before pandoc take actions (can be done via python/R or other, but probably needs interaction with the user).

jcolomb commented 4 years ago

This might be done via a bot like weadon at joss, when asked it would:

make a new branch
autocomplete the metadata from entries already given (get all information available)
commit the change on new branch.

The user would then be asked to delete wrong/outdated information before merging.

dhimmel commented 4 years ago

So I would just work on the output-schema one wants, and if some tools can autocomplete stuff, it needs to be done before pandoc

Okay, let's focus for now on the schema for metadata provided to pandoc and not pre-processors. And keep this topic in the back of our minds.

Would it make sense to develop the auto-population scripts here as well?

I think this would expand the scope of this project too much at the moment. And the solutions won't be universal since different users will have different computational constraints. That being said, perhaps eventually we could create an official set of Python / Haskell / Lua auto-completion scripts.

With Manubot, we're set considerable amounts of metadata automatically (example) in Python. I think there is a lot of opportunity to split out some of the more general purpose auto-completion, but first we should create the schema.

Pandoc does some additional metadata tweaks during runtime, which further complicates things a bit... like if the --bibliography option is supplied.

pandoc / scholarly-metadata

How to accommodate programmatic metadata alternations? #2