Add support for CRediT roles in JATS export

cthoyt commented 1 month ago

Here's some JATS XML that has author roles in it. It would be cool if pandoc supported this!

<article-meta>
<article-id pub-id-type="doi">10.1002/leap.1210</article-id>
…
<contrib-group>
<contrib contrib-type="author">
    <contrib-id content-type="orcid">0000-0002-9298-3168</contrib-id>
    <name>
        <surname>Allen</surname>
        <given-names>Liz</given-names>
    </name>
    <role vocab="credit" vocab-identifier="https://credit.niso.org/"
        vocab-term-identifier="https://credit.niso.org/contributor-roles/conceptualization/"
        vocab-term="Conceptualization">Conceptualization</role>
    <role vocab="credit" vocab-identifier="https://credit.niso.org/"
        vocab-term-identifier="https://credit.niso.org/contributor-roles/writing-original-draft/"
        vocab-term="Writing &#x2013; original draft">Writing &#x2013; original
        draft</role>
</contrib>

This is the template where this would need to get handled:

https://github.com/jgm/pandoc/blob/180d2b5a025362f7fdf20c68a8bc595e0938cb86/data/templates/article.jats_publishing#L92-L129

This is where this needs to get documented:

https://github.com/jgm/pandoc/blob/180d2b5a025362f7fdf20c68a8bc595e0938cb86/doc/jats.md?plain=1#L15

I am happy to try getting this working myself!

jgm commented 1 month ago

What sort of markdown metadata structure would you be envisioning?

cthoyt commented 1 month ago

@jgm thanks for the quick reply. I think inside the author block for document metadata, we need a roles list. Each role in the list is a dictionary with two keys:

type that has the actual CRediT identifier
degree that says Lead, Equal, or Supporting (as prescribed by JATS)

The JATS metadata is a bit strange since it requires both the identifier from CRediT and also the written out string, but I was hoping to use some kind of preprocessing script to inject the names so you only have to write in the identifiers

E.g.:

authors:
  - name: John Doe
    affiliation: [ 1 ]
    roles:
      - type: 'formal-analysis'
        degree: 'lead'
  - name: John Boss
    affiliation: [ 1 ]
    roles:
      - type: 'funding-acquisition'
        degree: 'lead'
      - type: 'supervision'
        degree: 'lead'

estedeahora commented 1 month ago

I think incorporating CREDIT roles in the jats template could be an interesting idea. In the project ~!guri_ we have proposed a similar modification of the template. Our template version looks like this:

$if(author.credit)$
$for(author.credit)$
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="$it.elem$" vocab-term-identifier="$it.uri$">$it.cont$</role>
$endfor$
$endif$

Broadly speaking, the basic criteria to implement it is similar. Here is an example of how the Markdown is generated for this templ

author:
- affiliation:
  - aff1
  credit:
  - cont: Conceptualización
    elem: Conceptualization
    uri: "https://credit.niso.org/contributor-roles/conceptualization/"
  - cont: Curación de datos
    elem: Data curation
    uri: "https://credit.niso.org/contributor-roles/data-curation/"

There are however some differences that I find interesting to highlight and that could perhaps be evaluated before incorporating:

As @cthoyt points out the <role> tag of xml-jats requires textual content. This content may vary depending on the language of the journal. In this sense, I think it is better to keep this field available in the Markdown inside role (in our example it is the cont field).
There is a problem with the vocab-term attribute (which in the commit uses type, we used elem) and the vocab-term-identifier attribute (which in the commit is completed by adding type to the base uri). While both attributes are related, this relationship is not as straightforward as presented. For example, for Investigation the valid uri is https://credit.niso.org/contributor-roles/investigation/ (i.e., it is just changed to lowercase), but for Writing -- review & editing the valid uri is https://credit.niso.org/contributor-roles/writing-review-editing/. That is, in some cases it's just returning everything to lowercase, but in others it involves replacing spaces with hyphens, removing “&” and controlling the --.
We have not implemented the degree-contribution attribute, which I think is an interesting addition.

What do you think of these observations and considerations?

cthoyt commented 1 month ago

@estedeahora that's great! I agree this effectively the same solution. The keys don't make a different, however I do prefer to have fully readable keys rather than abbreviations.

I think it's a mistake to curate duplicate information inside the metadata, such as the labels. The only important information in linked open data is local unique identifiers from well-defined semantic spaces. I would much prefer to have the backend look up the labels (especially since it's just 14) than making users write it in 3 different ways. Note that the URI can also be created using the elem field from your example.

In JOSS, I recently learned how to use some Lua filters to do post-processing of data. If someone can point me how I can post-process data in Pandoc, I can accomplish the same here

estedeahora commented 1 month ago

I completely agree with what you say about duplication of data in different Markdown fields. Ideally there should be only a type and a degree field in the Markdown (maybe optional a content field, see later). However, in order to make this work with Pandoc the linking between uri and term should be done using Haskell. I don't know how to program in Haskell (I tried and didn't succeed yet). As long as I can't do it, I'm not sure if those who know how to do it are interested. I want to add that if this were implemented in Haskell, when generating both fields (uri and elem), these should be done taking into account the Jats4R recommendation on this topic. In this sense, it would be important to respect the dictionary of terms provided there.

Even if this is done in Haskell, then the template should still have both fields. What Haskell would do "under the hood" is generate a new field from an original one. In your example you do this when you call this table credit_lookup[it.type] from type. In this sense, regardless of how this is solved, the final template will have two different fields (one to fill the vocab-term-identifier attribute and another to fill the vocab-term attribute).

Obviously all this can be solved with a filter, but filters are Pandoc extensions and therefore cannot be 'assumed to be present' in the Pandoc template (unless the Pandoc code is modified in Haskell). If this change were not made from Haskell, we could not assume that "something" (a filter) will make the new field from the original. In the meantime the duplicate field should be allowed to ensure that it works.

For illustration purposes, the workaround used in our work was to have a custom template that works with a lua filter. We actually do something a bit more complex in this filter, as we read the roles from a csv (which is originally an excel sent by the authors) and then assign the uri, cont and elem fields. In this sense, our proposal does not manually incorporate the duplicate fields, but as a result the filter has to give these fields separately so that the template can use them.

Another aspect about which I do have some reservations is the content of the <role> tag (in my example it is inside cont). If its generation were completely automated, the possibility of being used in different languages would be lost. I think the issue of different languages is very important and speaks of the universality of Pandoc. I think the ideal solution, if this were implemented in Haskell, would be to make the label optional and have Haskell generate it only if it is not present (otherwise it could use the English names).

Regarding the name of the fields is totally indistinct for me, I'm fine with the names you proposed. I only kept them in the previous comment so that the presented example is understood.

jgm / pandoc

Add support for CRediT roles in JATS export #10152