jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.61k stars 3.38k forks source link

PMC JATS contributor metadata incorrectly represented by pandoc #8866

Open castedo opened 1 year ago

castedo commented 1 year ago

I submit this issue because @kamoe was interested in seeing cases like this. This issue is one case of a more general issue #8359 (resolved as closed and out of scope in late 2022). I recommend this issue be closed as out of scope and people use an XML parser to extract PMC JATS metadata rather than pandoc.

Here is a summary of jats.xml.txt:

<article ...>
  <front>
    ...
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">https://orcid.org/0000-0001-9106-7259</contrib-id>
          <name>
            <surname>Baumdicker</surname>
            <given-names>Franz</given-names>
          </name>
          <xref rid="iyab229-aff1" ref-type="aff">1</xref>
          <xref rid="iyab229-FM1" ref-type="author-notes"/>
        </contrib>
        <contrib contrib-type="author">
          <contrib-id contrib-id-type="orcid" authenticated="false">https://orcid.org/0000-0001-8327-0142</contrib-id>
          <name>
            <surname>Bisschop</surname>
            <given-names>Gertjan</given-names>
          </name>
          <xref rid="iyab229-aff2" ref-type="aff">2</xref>
          <xref rid="iyab229-FM1" ref-type="author-notes"/>
        </contrib>
        ...
        <contrib contrib-type="author" corresp="yes">
          <contrib-id contrib-id-type="orcid" authenticated="false">https://orcid.org/0000-0002-7894-5253</contrib-id>
          <name>
            <surname>Kelleher</surname>
            <given-names>Jerome</given-names>
          </name>
          <xref rid="iyab229-cor1" ref-type="corresp"/>
          <xref rid="iyab229-aff7" ref-type="aff">7</xref>
          <!--jerome.kelleher@bdi.ox.ac.uk-->
        </contrib>
      </contrib-group>
      <aff id="iyab229-aff1"><label>1</label><institution>Cluster of Excellence...</institution>, ... <country country="DE">Germany</country></aff>
      ...
      <contrib-group>
        <contrib contrib-type="editor">
          <name>
            <surname>Browning</surname>
            <given-names>S</given-names>
          </name>
          <role>Editor</role>
        </contrib>
      </contrib-group>
      <author-notes>
        <fn id="iyab229-FM1">
          <label>&#x2020;</label>
          <p>Franz Baumdicker, Gertjan Bisschop, Daniel Goldstein, Graham Gower, Aaron P. Ragsdale, Georgia Tsambos and Sha Zhu contributed equally to this work, as joint first authors.</p>
        </fn>
        ...
        <corresp id="iyab229-cor1">Corresponding author: Email: <email>jerome.kelleher@bdi.ox.ac.uk</email></corresp>
      </author-notes>
      ...
    </article-meta>
  </front>
</article>

which is a simplification of the PMC JATS XML file of article https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9176297/ This is not a contrived example. This is the very first JATS XML file I picked out of the millions of JATS XML files archived in the PMC Open Access Subset.

The AST produced by pandoc is out.json.txt.

There are many ways in which pandoc fails to perform a satisfactory representation of PMC JATS contributors. Here are some highlights:

  1. the editor (S Browning) is represented as one of the authors
  2. the information about which name is the corresponding author has been lost
  3. all of the ORCIDs have been lost
  4. the identification of what parts of the names are surname vs given names has been lost
  5. the authors notes, which apply to various parts of the authors list, are included within the "MetaInLines" of the last author
  6. the information linking which authors notes to which author names, like which authors contributed equally as first authors, has been lost
  7. there is no way to link the institutional affiliations with the authors
  8. the institution names are prepended with an integer, as though an integer is part of the institution name
castedo commented 1 year ago

Also worth noting that some of these subissues also apply to JATS that pandoc generates and is documented as JATS supported by pandoc on https://pandoc.org/jats.html. For instance, the ORCIDs and separate surnames from given names.

kamoe commented 1 year ago

So current output for this example is:

Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "author"
            , MetaList
                [ MetaInlines [ Str "Franz" , Space , Str "Baumdicker" ]
                , MetaInlines [ Str "Gertjan" , Space , Str "Bisschop" ]
                , MetaInlines
                    [ Str "Jerome"
                    , Space
                    , Str "Kelleher"
                    , Note
                        [ Para
                            [ Str "Franz"
                            , Space
                            , Str "Baumdicker,"
                            , Space
                            , Str "Gertjan"
                            , Space
                            , Str "Bisschop,"
                            , Space
                            , Str "Daniel"
                            , Space
                            , Str "Goldstein,"
                            , Space
                            , Str "Graham"
                            , Space
                            , Str "Gower,"
                            , Space
                            , Str "Aaron"
                            , Space
                            , Str "P."
                            , Space
                            , Str "Ragsdale,"
                            , Space
                            , Str "Georgia"
                            , Space
                            , Str "Tsambos"
                            , Space
                            , Str "and"
                            , Space
                            , Str "Sha"
                            , Space
                            , Str "Zhu"
                            , Space
                            , Str "contributed"
                            , Space
                            , Str "equally"
                            , Space
                            , Str "to"
                            , Space
                            , Str "this"
                            , Space
                            , Str "work,"
                            , Space
                            , Str "as"
                            , Space
                            , Str "joint"
                            , Space
                            , Str "first"
                            , Space
                            , Str "authors."
                            ]
                        ]
                    , SoftBreak
                    , Str "Corresponding"
                    , Space
                    , Str "author:"
                    , Space
                    , Str "Email:"
                    , Space
                    , Link
                        ( "" , [] , [] )
                        [ Str "jerome.kelleher@bdi.ox.ac.uk" ]
                        ( "mailto:jerome.kelleher@bdi.ox.ac.uk" , "" )
                    ]
                ]
            )
          , ( "institute"
            , MetaList
                [ MetaInlines
                    [ Str "1Cluster"
                    , Space
                    , Str "of"
                    , Space
                    , Str "Excellence...,"
                    , Space
                    , Str "..."
                    , Space
                    , Str "Germany"
                    ]
                ]
            )
          ]
    }
  []

I played around with what we did for https://github.com/jgm/pandoc/issues/8867. and got the following alternative:

Pandoc
  Meta
    { unMeta =
        fromList
          [ ( "author"
            , MetaList
                [ MetaMap
                    (fromList
                       [ ( "contrtib-id"
                         , MetaString
                             "https://orcid.org/0000-0001-9106-7259"
                         )
                       , ( "given-names" , MetaString "Franz" )
                       , ( "surname" , MetaString "Baumdicker" )
                       ])
                , MetaMap
                    (fromList
                       [ ( "contrtib-id"
                         , MetaString
                             "https://orcid.org/0000-0001-8327-0142"
                         )
                       , ( "given-names" , MetaString "Gertjan" )
                       , ( "surname" , MetaString "Bisschop" )
                       ])
                , MetaMap
                    (fromList
                       [ ( "contrtib-id"
                         , MetaString
                             "https://orcid.org/0000-0002-7894-5253"
                         )
                       , ( "given-names" , MetaString "Jerome" )
                       , ( "surname" , MetaString "Kelleher" )
                       ])
                ]
            )
          , ( "author-notes"
            , MetaList
                [ MetaInlines
                    [ Note
                        [ Para
                            [ Str "Franz"
                            , Space
                            , Str "Baumdicker,"
                            , Space
                            , Str "Gertjan"
                            , Space
                            , Str "Bisschop,"
                            , Space
                            , Str "Daniel"
                            , Space
                            , Str "Goldstein,"
                            , Space
                            , Str "Graham"
                            , Space
                            , Str "Gower,"
                            , Space
                            , Str "Aaron"
                            , Space
                            , Str "P."
                            , Space
                            , Str "Ragsdale,"
                            , Space
                            , Str "Georgia"
                            , Space
                            , Str "Tsambos"
                            , Space
                            , Str "and"
                            , Space
                            , Str "Sha"
                            , Space
                            , Str "Zhu"
                            , Space
                            , Str "contributed"
                            , Space
                            , Str "equally"
                            , Space
                            , Str "to"
                            , Space
                            , Str "this"
                            , Space
                            , Str "work,"
                            , Space
                            , Str "as"
                            , Space
                            , Str "joint"
                            , Space
                            , Str "first"
                            , Space
                            , Str "authors."
                            ]
                        ]
                    , SoftBreak
                    , Str "Corresponding"
                    , Space
                    , Str "author:"
                    , Space
                    , Str "Email:"
                    , Space
                    , Link
                        ( "" , [] , [] )
                        [ Str "jerome.kelleher@bdi.ox.ac.uk" ]
                        ( "mailto:jerome.kelleher@bdi.ox.ac.uk" , "" )
                    ]
                ]
            )
          , ( "institute"
            , MetaList
                [ MetaInlines
                    [ Str "1Cluster"
                    , Space
                    , Str "of"
                    , Space
                    , Str "Excellence...,"
                    , Space
                    , Str "..."
                    , Space
                    , Str "Germany"
                    ]
                ]
            )
          ]
    }
  []

I think the above addresses points 1-5. In a nutshell, this splits authors names, keeps their id, and takes the author-notes out of the last author and places it at the same level of author. Drawback is, when converting back to JATS, however, the corresponding author information is lost, compared to the result obtained with current output, BUT I think that would be a Writer issue, since the native format did not lose that information.

We could keep developing this, but I am conscious that a well-presented solution that circles back all desirable information won't stop at just the JATS reader, as I said, some of this responsibility resides in updating the writers (which can quickly spiral the scope...)

@jgm What do you think?

castedo commented 1 year ago

My quick 2 cents on the above is that splitting authors names and keeping the orcid are great enhancements. I suggest ditching the author-notes for now. That's a big can of worms and it's not clear pandoc users would use this in any consistent way any time soon. I suspect existing JATS is a mess with 'author-notes'. The real JATS XML example way above that I gave has 'author-notes' that is a metadata mess.

I'm not sure what the state of email under author is, but that seems like worth making sure it round-trips/circles through the reader and writer. In summary I think the basic key author structure to round-trip/cricle through the reader is just a list of authors where each author has four submembers of surname, given-names, email address(es) and orcid.

Now that summer is over I can spend some time helping define different scopes of JATS data. In particular, I'm trying to nail down a tiny subset of JATS4R that the pandoc wrappers basecast and epijats should support. Partly for a lack of a better name, I'll call this tiny subset "Baseprint JATS" (https://github.com/singlesourcepub/community/discussions/53). With something like this enhancement done, I think a really good initial version for Baseprint JATS is JATS XML that circles back to itself via pandoc JATS Reader and pandoc JATS Writer.

I can write more reasoning and docs on this topic this week.

jgm commented 1 year ago

It's a hard call. The current approach is compatible with what writers & templates expect, and I think it would be quite a job updating all the writers & templates to handle BOTH that and this particular structured approach. On the other hand, in most formats template changes should suffice to handle the structured authors, and maybe that could be left up to users. I guess it depends on what the main uses of the JATS reader are going to be.

castedo commented 1 year ago

Is it easy for the JATS Reader to spit out a highly structured format specific object model (e.g. separate given names and surname) and then pandoc by default uses a built-in filter that collapses that object model to something simpler (e.g. just simple single string full name) that is more compatible with writers and templates of many formats. Then for advanced JATS a user can replace/disable the default filter so that the full JATS obj model is exposed?

Is this an approach that solves the trade-offs between supporting simple object models for many formats vs a complicated object model of one specific format?

castedo commented 1 year ago

I can't speak for other users of pandoc, but I keep coming back to the rough conclusion of #8359. One can categorize four types of data coming out of JATS: A) rich text (body and abstract and maybe title) B) simple metadata that is found in lots of other formats C) bibliography data D) highly specialized and variable research publication metadata found in PMC JATS

Pandoc seems great to use for A, B, C but not D. Direct access via an XML parser and/or library specifically for JATS (like elifetools) seems like a better choice for D.

In epijats I use elifetools currently to parse contributor information. it does a lot of extra processing and transformation of contributor information into Python objects too. To be honest and frank, I'm not sure I would even use this enhancement if it showed up in pandoc.

Interestingly, I will most likely use recent enhancement #8867 because elifetools does not handle that JATS4R metadata. I suspect it is because of the XML namespace that elifetools does not handle it.

kamoe commented 1 year ago

Thanks for the input, I think I see the challenge here. I played around converting both current and alternative native versions above to markdown, html, and dockbook, and sadly I can see that with the new alternative both html and dockbook result in complete loss of author content. The output for markdown was improved and structured, but alas, I don't think it makes sense to go ahead with this idea if it is to have such a big impact on conversion into other formats.

I think improvements like these have a place and can be done (the fact that we can build this prototype alternative proves pandoc can support this kind of data structure, and can circle it back), but as @jgm said, it will be quite a job to sync all readers and writers to be able to understand and process it. Given that we don't seem to have a strong user case for this, I'm happy to park this, for now, and focus on the other open issues we have. I would love to come back to this at a later time though, once I understand better how other components of pandoc work and how feasible a more coordinated approach can be (I'm just experimenting with the JATS reader, for now).

castedo commented 1 year ago

Sounds like a good plan.

On a related note, I have not started on a Baseprints JATS XML validator yet. But I suspect the parsing of the rich text inside JATS might be a good application of the Pandoc JATS Reader (by checking that the rich text in JATS XML cirlces/round-trips through the pandoc AST. I imagine the XML schema for rich text inside BITS is the same as JATS (but I'm not sure).