Closed giorgiosironi closed 5 years ago
Subject areas are defined in JATS inside the element article-categories
, using subj-group
and subject
elements.
IJM use the same model as eLife here, for example:
<article-categories>
<subj-group subj-group-type="display-channel">
<subject>Research article</subject>
</subj-group>
<subj-group subj-group-type="heading">
<subject>Methodology</subject>
</subj-group>
<subj-group subj-group-type="heading">
<subject>Dynamic microsimulation</subject>
</subj-group>
</article-categories>
Hindawi content (going from the samples that we have) only define the article type (rather than any MSA/research subject):
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
</article-categories>
bioRxiv example:
<article-categories>
<subj-group subj-group-type="author-type">
<subject>Regular Article</subject>
</subj-group>
<subj-group subj-group-type="heading">
<subject>New Results</subject>
</subj-group>
<subj-group subj-group-type="hwp-journal-coll">
<subject>Bioinformatics</subject>
</subj-group>
</article-categories>
@giorgiosironi
Difficulty might be that the value of the subj-group-type
attribute may change across journals (there is no default).
Going to look into how else this is used in content from the PubMed archive...
To clarify what I mean by
Difficulty might be that the value of the
subj-group-type
attribute may change across journals (there is no default).
There are essential two categories usually defined in JATS in the article-categories
element, the article type, and a subject area (if there is one).
If we take IJM and Hindawi as examples it, the subj-group[@subj-group-type="heading"]
has been used to define each - in IJM it states a subject area and in Hindawi it defines a type of article.
Research Article
type is also in the scope of this, whereas Neurosciences
type isjournal-cms
/subjects
is to assign non-XML content like blog articles to those as wellsubj-group-type="heading"
is not ~static~ consistent across publishersCancer Biology
) to MSA ids (cancer-biology
)?Question: is the XML always changed to do a subject change? Assumption: if we can change the XML of articles (including the typesetter side) that undergo a subject change in a reasonable fast way, we could use it as the source of truth
We may model this in Producer then, as the XML that would come out of it would have the new subject and would be ingested in a silent correction or new version. We confirm this happens in all versions (v1, v2).
Are there any additional categories added to an article that are not in an XML? In the eLife case no.
- big question: can we assume there is a mapping from XML values (
Cancer Biology
) to MSA ids (cancer-biology
)?
Worth mentioning that JATS 1.2 added attributes for such IDs.
- one of the purposes of
journal-cms
/subjects
is to assign non-XML content like blog articles to those as well
Not really, it's about subjects being more than just a title.
Some Continuum examples for reference: https://elifesciences.org/subjects/computational-systems-biology https://github.com/elifesciences/elife-tools/blob/5369493925bbcc01d85a32111d428e52708c1f3c/elifetools/rawJATS.py#L153-L163 https://github.com/elifesciences/bot-lax-adaptor/blob/develop/src/main.py#L192-L199
Not really, it's about subjects being more than just a title.
Was a quick note, but "one of", the "normal form" of subjects as independent entities from the articles that use them is clear by now.
Here's an example from PLOS bio with numerous child subjects:
Note that this isn't very representative of how many publishers use these elements. From a set of ~1,000,000 articles taken from PMC, 14% (138,000) contain an article-categories
element with nested subj-group
s.
<article-categories>
<subj-group subj-group-type="heading">
<subject>Research Article</subject>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Animal studies</subject>
<subj-group>
<subject>Experimental organism systems</subject>
<subj-group>
<subject>Model organisms</subject>
<subj-group>
<subject>Drosophila melanogaster</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Model organisms</subject>
<subj-group>
<subject>Drosophila melanogaster</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Animal studies</subject>
<subj-group>
<subject>Experimental organism systems</subject>
<subj-group>
<subject>Animal models</subject>
<subj-group>
<subject>Drosophila melanogaster</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Organisms</subject>
<subj-group>
<subject>Eukaryota</subject>
<subj-group>
<subject>Animals</subject>
<subj-group>
<subject>Invertebrates</subject>
<subj-group>
<subject>Arthropoda</subject>
<subj-group>
<subject>Insects</subject>
<subj-group>
<subject>Drosophila</subject>
<subj-group>
<subject>Drosophila melanogaster</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Cell biology</subject>
<subj-group>
<subject>Cell processes</subject>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Epigenetics</subject>
<subj-group>
<subject>RNA interference</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Gene expression</subject>
<subj-group>
<subject>RNA interference</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Genetics</subject>
<subj-group>
<subject>Genetic interference</subject>
<subj-group>
<subject>RNA interference</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Nucleic acids</subject>
<subj-group>
<subject>RNA</subject>
<subj-group>
<subject>RNA interference</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Molecular biology</subject>
<subj-group>
<subject>Molecular biology techniques</subject>
<subj-group>
<subject>Molecular biology assays and analysis techniques</subject>
<subj-group>
<subject>Gene expression and vector techniques</subject>
<subj-group>
<subject>Hyperexpression techniques</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Molecular biology techniques</subject>
<subj-group>
<subject>Molecular biology assays and analysis techniques</subject>
<subj-group>
<subject>Gene expression and vector techniques</subject>
<subj-group>
<subject>Hyperexpression techniques</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Proteins</subject>
<subj-group>
<subject>DNA-binding proteins</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Enzymology</subject>
<subj-group>
<subject>Enzymes</subject>
<subj-group>
<subject>Ligases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Proteins</subject>
<subj-group>
<subject>Enzymes</subject>
<subj-group>
<subject>Ligases</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Biology and life sciences</subject>
<subj-group>
<subject>Biochemistry</subject>
<subj-group>
<subject>Proteins</subject>
<subj-group>
<subject>Post-translational modification</subject>
<subj-group>
<subject>Ubiquitination</subject>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
</subj-group>
<subj-group subj-group-type="Discipline-v3">
<subject>Research and analysis methods</subject>
<subj-group>
<subject>Precipitation techniques</subject>
<subj-group>
<subject>Immunoprecipitation</subject>
</subj-group>
</subj-group>
</subj-group>
</article-categories>
Of the same 1,000,000 PMC articles, only 1% (11,044) have a subj-group
containing more than 1 subject
elements. Example (in the subj-group[@subj-group-type="hwp-journal-coll")]
:
<article-categories>
<subj-group subj-group-type="hwp-journal-coll">
<subject>1014</subject>
<subject>1009</subject>
<subject>1011</subject>
<subject>1053</subject>
</subj-group>
<subj-group subj-group-type="heading">
<subject>Research Articles</subject>
</subj-group>
<subj-group subj-group-type="overline">
<subject>SPECIAL ISSUE: Scaling Effects Regulating Plant Response to Global
Change</subject>
</subj-group>
</article-categories>
Of this 1%, only 4,424 contain subj-group[not(@subj-group-type="hwp-journal-coll")]
elements with more than one subject
, which indicates that the normal capture is for a subj-group
element to have only 1 subject
(and a possible child subj-group
- although in the majority of cases this isn't present).
Possibly noteworthy that subj-group[@subj-group-type="hwp-journal-coll")]
is presumably a Highwire press requirement.
Example of the same subject in separate languages (provided by Érudit):
<article-categories>
<subj-group subj-group-type="heading" xml:lang="en">
<subject>Front Matter</subject>
</subj-group>
<subj-group subj-group-type="heading" xml:lang="fr">
<subject>Liminaire</subject>
</subj-group>
</article-categories>
Note @xml:lang
is included on the subj-group
rather than the subject
.
Noting also that there's a JATS element, compound-subject
which can be used in place of subject
, inside subj-group
.
This is not commonly used (627 of the 1,000,000 PMC articles contain it), so unsure whether we want to support it - certainly not needed for MVP (not in IJM, Hindawi or bioRxiv content that we've seen).
Here's an example of it's use:
<article-categories>
<subj-group subj-group-type="heading">
<subject>Case Report</subject>
</subj-group>
<subj-group subj-group-type="topic">
<compound-subject>
<compound-subject-part content-type="code">bjrcr</compound-subject-part>
<compound-subject-part content-type="label">BJRCR</compound-subject-part>
</compound-subject>
<compound-subject>
<compound-subject-part content-type="code">gen-trct</compound-subject-part>
<compound-subject-part content-type="label">Genitourinary
tract</compound-subject-part>
</compound-subject>
<compound-subject>
<compound-subject-part content-type="code">ct</compound-subject-part>
<compound-subject-part content-type="label">CT</compound-subject-part>
</compound-subject>
</subj-group>
</article-categories>
The root article
element may have an attribute @article-type
. I include this information for completeness, I don't think we need to take this into consideration in this context, but I suppose we could (I don't know how it is currently used for Continuum for example). This attribute has a set of suggested values (provided by the JATS committee), which we should work from if included here.
Example:
<article article-type="research-article">
...
</article>
I don't know how it is currently used for Continuum for example
Don't believe it's used anywhere. (Edit: don't know about the Bot.)
Thank you for all the examples, I am distilling a checklist in the XML section of https://docs.google.com/document/d/1PmLD6NEiqpjKTnjW-n1Hho_G-znp5lW3ba8PGmOHnQE/edit?ts=5d662f5c#
F1000/Wellcome/Gates open res used nested subj-group
element (only to one degree)
Example:
<subj-group>
<subject>Articles</subject>
<subj-group>
<subject>Bioinformatics</subject>
</subj-group>
</subj-group>
It's a bit strange - all of the parent subj-group
s have a <subject>Articles</subject>
- the nesting itself, doesn't seem to carry any meaning (as far as I can tell).
Should this be closed as a duplicate of https://github.com/libero/publisher/issues/256?
@giorgiosironi, yes, the only reason I haven't is because I am unable to (lack the necessary permissions).
Added you to https://github.com/orgs/libero/teams/elife-developers/members so you should have the same permissions as anyone else now.
Ace, thanks!
Problem / Motivation
Collect sample JATS XML describing possible categories of a journal, in order to inform a data model for them that can work across multiple organizations/publishers/journals.
Tasks
Clarification needed and assumptions