Open cthoyt opened 2 months ago
What sort of markdown metadata structure would you be envisioning?
@jgm thanks for the quick reply. I think inside the author
block for document metadata, we need a roles
list. Each role in the list is a dictionary with two keys:
type
that has the actual CRediT identifierdegree
that says Lead
, Equal
, or Supporting
(as prescribed by JATS)The JATS metadata is a bit strange since it requires both the identifier from CRediT and also the written out string, but I was hoping to use some kind of preprocessing script to inject the names so you only have to write in the identifiers
E.g.:
authors:
- name: John Doe
affiliation: [ 1 ]
roles:
- type: 'formal-analysis'
degree: 'lead'
- name: John Boss
affiliation: [ 1 ]
roles:
- type: 'funding-acquisition'
degree: 'lead'
- type: 'supervision'
degree: 'lead'
I think incorporating CREDIT roles in the jats template could be an interesting idea. In the project ~!guri_ we have proposed a similar modification of the template. Our template version looks like this:
$if(author.credit)$
$for(author.credit)$
<role vocab="credit" vocab-identifier="https://credit.niso.org/" vocab-term="$it.elem$" vocab-term-identifier="$it.uri$">$it.cont$</role>
$endfor$
$endif$
Broadly speaking, the basic criteria to implement it is similar. Here is an example of how the Markdown is generated for this templ
author:
- affiliation:
- aff1
credit:
- cont: Conceptualización
elem: Conceptualization
uri: "https://credit.niso.org/contributor-roles/conceptualization/"
- cont: Curación de datos
elem: Data curation
uri: "https://credit.niso.org/contributor-roles/data-curation/"
There are however some differences that I find interesting to highlight and that could perhaps be evaluated before incorporating:
<role>
tag of xml-jats requires textual content. This content may vary depending on the language of the journal. In this sense, I think it is better to keep this field available in the Markdown inside role
(in our example it is the cont
field).vocab-term
attribute (which in the commit uses type
, we used elem
) and the vocab-term-identifier
attribute (which in the commit is completed by adding type
to the base uri). While both attributes are related, this relationship is not as straightforward as presented. For example, for Investigation
the valid uri is https://credit.niso.org/contributor-roles/investigation/
(i.e., it is just changed to lowercase), but for Writing -- review & editing
the valid uri is https://credit.niso.org/contributor-roles/writing-review-editing/
. That is, in some cases it's just returning everything to lowercase, but in others it involves replacing spaces with hyphens, removing “&” and controlling the --. degree-contribution
attribute, which I think is an interesting addition. What do you think of these observations and considerations?
@estedeahora that's great! I agree this effectively the same solution. The keys don't make a different, however I do prefer to have fully readable keys rather than abbreviations.
I think it's a mistake to curate duplicate information inside the metadata, such as the labels. The only important information in linked open data is local unique identifiers from well-defined semantic spaces. I would much prefer to have the backend look up the labels (especially since it's just 14) than making users write it in 3 different ways. Note that the URI can also be created using the elem
field from your example.
In JOSS, I recently learned how to use some Lua filters to do post-processing of data. If someone can point me how I can post-process data in Pandoc, I can accomplish the same here
I completely agree with what you say about duplication of data in different Markdown fields. Ideally there should be only a type
and a degree
field in the Markdown (maybe optional a content
field, see later). However, in order to make this work with Pandoc the linking between uri and term should be done using Haskell. I don't know how to program in Haskell (I tried and didn't succeed yet). As long as I can't do it, I'm not sure if those who know how to do it are interested. I want to add that if this were implemented in Haskell, when generating both fields (uri
and elem
), these should be done taking into account the Jats4R recommendation on this topic. In this sense, it would be important to respect the dictionary of terms provided there.
Even if this is done in Haskell, then the template should still have both fields. What Haskell would do "under the hood" is generate a new field from an original one. In your example you do this when you call this table credit_lookup[it.type]
from type
. In this sense, regardless of how this is solved, the final template will have two different fields (one to fill the vocab-term-identifier
attribute and another to fill the vocab-term
attribute).
Obviously all this can be solved with a filter, but filters are Pandoc extensions and therefore cannot be 'assumed to be present' in the Pandoc template (unless the Pandoc code is modified in Haskell). If this change were not made from Haskell, we could not assume that "something" (a filter) will make the new field from the original. In the meantime the duplicate field should be allowed to ensure that it works.
For illustration purposes, the workaround used in our work was to have a custom template that works with a lua filter. We actually do something a bit more complex in this filter, as we read the roles from a csv (which is originally an excel sent by the authors) and then assign the uri
, cont
and elem
fields. In this sense, our proposal does not manually incorporate the duplicate fields, but as a result the filter has to give these fields separately so that the template can use them.
Another aspect about which I do have some reservations is the content of the <role>
tag (in my example it is inside cont
). If its generation were completely automated, the possibility of being used in different languages would be lost. I think the issue of different languages is very important and speaks of the universality of Pandoc. I think the ideal solution, if this were implemented in Haskell, would be to make the label optional and have Haskell generate it only if it is not present (otherwise it could use the English names).
Regarding the name of the fields is totally indistinct for me, I'm fine with the names you proposed. I only kept them in the previous comment so that the presented example is understood.
Here's some JATS XML that has author roles in it. It would be cool if pandoc supported this!
This is the template where this would need to get handled:
https://github.com/jgm/pandoc/blob/180d2b5a025362f7fdf20c68a8bc595e0938cb86/data/templates/article.jats_publishing#L92-L129
This is where this needs to get documented:
https://github.com/jgm/pandoc/blob/180d2b5a025362f7fdf20c68a8bc595e0938cb86/doc/jats.md?plain=1#L15
I am happy to try getting this working myself!