jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.28k stars 3.36k forks source link

pandoc metadata as representation of JATS metadata #8359

Closed castedo closed 2 years ago

castedo commented 2 years ago

In using pandoc I've encountered issues that I'm not sure whether to consider inside or outside the scope of what pandoc should handle.

This issue/feature of pandoc metadata representing JATS metadata can probably be closed, but I wanted to share my usage scenario and double check what is outside of scope. To frame the scope, I suspect the following question is useful:

What is the pandoc metadata for JATS supposed to be? Is it:

  1. a highly interoperable common data schema to be shared by many different formats, or
  2. a YAML representation that is convenient for authors to set data inside JATS XML output, or
  3. A JSON-compatible passive data structure [1] representation of JATS XML article meta data.

Currently it seems the answer is primarily 1) and optionally 2), and not 3). I'd say pandoc currently does a poor job doing 3) which I hope is because that's out of scope.

Here's a concrete usage case that I'm affected by which illustrates some of the issues. In my YAML header I have the following metadata for pandoc:

author:
- surname: Ellerman
  given-names: E. Castedo
  email: castedo@castedo.com
  orcid: 0000-0002-5014-4809
date:
  iso-8601: 2022-08-24
  type: eprint
  year: 2022
  month: 08
  day: 24

which outputs the following JATS XML:

  <contrib-group>
    <contrib contrib-type="author">
      <contrib-id contrib-id-type="orcid">0000-0002-5014-4809</contrib-id>
      <name>
        <surname>Ellerman</surname>
        <given-names>E. Castedo</given-names>
      </name>
      <email>castedo@castedo.com</email>
    </contrib>
  </contrib-group>
  <pub-date date-type="eprint" publication-format="electronic" iso-8601-date="2022-08-24">
    <day>24</day>
    <month>8</month>
    <year>2022</year>
  </pub-date>

That JATS XML if converted back into YAML+markdown via pandoc becomes:

author:
- E. Castedo Ellerman
date: 2022-08-24

If pandoc metadata is supposed to be primarily 1) and secondarily 2) then this seems fine, and this issues can be closed. If not, then I can file some more issues. I am currently starting to use separate Python libraries to extract metadata from JATS XML.

Thank y'all for such a wonderful tool!

[1] https://en.wikipedia.org/wiki/Passive_data_structure

jgm commented 2 years ago

There isn't currently a standardized structured metadata format that will work optimally with all formats pandoc supports. The JATS writer supports JATS-specific structured metadata, as you've illustrated. But should the JATS reader produce this too? That would be very useful if you're going to re-render as JATS. (Then again, converting JATS to JATS is not so useful.) But if you're going to be rendering some other format, then you'd prefer to have something every pandoc format can handle, which is what the JATS reader currently gives you.

jgm commented 2 years ago

I think @tarleb has done some thinking about standardizing structured metadata, e.g. in his scholarly markdown project, so he may want to comment.

castedo commented 2 years ago

(Then again, converting JATS to JATS is not so useful.) But if you're going to be rendering some other format, then you'd prefer to have something every pandoc format can handle

Great point that I very much agree with.

castedo commented 1 year ago

For reference, I will use this closed issue as a high-level level nexus for other more specific issues that relate to pandoc metadata representing JATS metadata.

"JATS" is ambiguous since there are so many dialects of JATS. I can suggest some names for dialects. I list them in rough order from least specific to most specific:

castedo commented 1 year ago

@kamoe, here's a summary of issues with pandoc attempting to represent JATS metadata.

There are issues where the pandoc reader incorrectly represent metadata in JATS:

8865

8866

This is not just PMC JATS but also JATS that pandoc generates and is documented on https://pandoc.org/jats.html

Then there's PMC & pandoc JATS metadata that isn't read at all and absent from pandoc metadata from the reader:

8867

Last but not least, in addition to the above, there are more JATS elements documented on https://pandoc.org/jats.html and show up in PMC XML but do not appear pandoc metadata from the JATS reader:

My solution to all these problems is the not use pandoc and instead use an XML parser. The fixes and enhancements that I would actually use are improvements/fixes to processing of not metadata, but rather marked-up text (e.g. #8847).

kamoe commented 1 year ago

Thanks for this @castedo. I note all your comments and concerns, and will take a good look at this. I'm very interested from the perspective of the implications for a future BITS reader, so this is all very relevant. The more bugs JATS gets addressed, the less issues BITS inherits!