libero / publisher

The starting point for raising issues for Libero Publisher
MIT License
16 stars 4 forks source link

Collect XML samples for Research Categories #234

Closed giorgiosironi closed 5 years ago

giorgiosironi commented 5 years ago

Problem / Motivation

Collect sample JATS XML describing possible categories of a journal, in order to inform a data model for them that can work across multiple organizations/publishers/journals.

Tasks

Clarification needed and assumptions

fred-atherden commented 5 years ago

Subject areas are defined in JATS inside the element article-categories, using subj-group and subject elements.

@giorgiosironi

Difficulty might be that the value of the subj-group-type attribute may change across journals (there is no default).

Going to look into how else this is used in content from the PubMed archive...

fred-atherden commented 5 years ago

To clarify what I mean by

Difficulty might be that the value of the subj-group-type attribute may change across journals (there is no default).

There are essential two categories usually defined in JATS in the article-categories element, the article type, and a subject area (if there is one).

If we take IJM and Hindawi as examples it, the subj-group[@subj-group-type="heading"] has been used to define each - in IJM it states a subject area and in Hindawi it defines a type of article.

giorgiosironi commented 5 years ago
giorgiosironi commented 5 years ago

Question: is the XML always changed to do a subject change? Assumption: if we can change the XML of articles (including the typesetter side) that undergo a subject change in a reasonable fast way, we could use it as the source of truth

We may model this in Producer then, as the XML that would come out of it would have the new subject and would be ingested in a silent correction or new version. We confirm this happens in all versions (v1, v2).

giorgiosironi commented 5 years ago

Are there any additional categories added to an article that are not in an XML? In the eLife case no.

thewilkybarkid commented 5 years ago
  • big question: can we assume there is a mapping from XML values (Cancer Biology) to MSA ids (cancer-biology)?

Worth mentioning that JATS 1.2 added attributes for such IDs.

thewilkybarkid commented 5 years ago
  • one of the purposes of journal-cms /subjects is to assign non-XML content like blog articles to those as well

Not really, it's about subjects being more than just a title.

giorgiosironi commented 5 years ago

Some Continuum examples for reference: https://elifesciences.org/subjects/computational-systems-biology https://github.com/elifesciences/elife-tools/blob/5369493925bbcc01d85a32111d428e52708c1f3c/elifetools/rawJATS.py#L153-L163 https://github.com/elifesciences/bot-lax-adaptor/blob/develop/src/main.py#L192-L199

giorgiosironi commented 5 years ago

Not really, it's about subjects being more than just a title.

Was a quick note, but "one of", the "normal form" of subjects as independent entities from the articles that use them is clear by now.

fred-atherden commented 5 years ago

Here's an example from PLOS bio with numerous child subjects:

Note that this isn't very representative of how many publishers use these elements. From a set of ~1,000,000 articles taken from PMC, 14% (138,000) contain an article-categories element with nested subj-groups.

<article-categories>
    <subj-group subj-group-type="heading">
        <subject>Research Article</subject>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Research and analysis methods</subject>
        <subj-group>
            <subject>Animal studies</subject>
            <subj-group>
                <subject>Experimental organism systems</subject>
                <subj-group>
                    <subject>Model organisms</subject>
                    <subj-group>
                        <subject>Drosophila melanogaster</subject>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Research and analysis methods</subject>
        <subj-group>
            <subject>Model organisms</subject>
            <subj-group>
                <subject>Drosophila melanogaster</subject>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Research and analysis methods</subject>
        <subj-group>
            <subject>Animal studies</subject>
            <subj-group>
                <subject>Experimental organism systems</subject>
                <subj-group>
                    <subject>Animal models</subject>
                    <subj-group>
                        <subject>Drosophila melanogaster</subject>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Organisms</subject>
            <subj-group>
                <subject>Eukaryota</subject>
                <subj-group>
                    <subject>Animals</subject>
                    <subj-group>
                        <subject>Invertebrates</subject>
                        <subj-group>
                            <subject>Arthropoda</subject>
                            <subj-group>
                                <subject>Insects</subject>
                                <subj-group>
                                    <subject>Drosophila</subject>
                                    <subj-group>
                                        <subject>Drosophila melanogaster</subject>
                                    </subj-group>
                                </subj-group>
                            </subj-group>
                        </subj-group>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Cell biology</subject>
            <subj-group>
                <subject>Cell processes</subject>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Genetics</subject>
            <subj-group>
                <subject>Epigenetics</subject>
                <subj-group>
                    <subject>RNA interference</subject>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Genetics</subject>
            <subj-group>
                <subject>Gene expression</subject>
                <subj-group>
                    <subject>RNA interference</subject>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Genetics</subject>
            <subj-group>
                <subject>Genetic interference</subject>
                <subj-group>
                    <subject>RNA interference</subject>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Biochemistry</subject>
            <subj-group>
                <subject>Nucleic acids</subject>
                <subj-group>
                    <subject>RNA</subject>
                    <subj-group>
                        <subject>RNA interference</subject>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Molecular biology</subject>
            <subj-group>
                <subject>Molecular biology techniques</subject>
                <subj-group>
                    <subject>Molecular biology assays and analysis techniques</subject>
                    <subj-group>
                        <subject>Gene expression and vector techniques</subject>
                        <subj-group>
                            <subject>Hyperexpression techniques</subject>
                        </subj-group>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Research and analysis methods</subject>
        <subj-group>
            <subject>Molecular biology techniques</subject>
            <subj-group>
                <subject>Molecular biology assays and analysis techniques</subject>
                <subj-group>
                    <subject>Gene expression and vector techniques</subject>
                    <subj-group>
                        <subject>Hyperexpression techniques</subject>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Biochemistry</subject>
            <subj-group>
                <subject>Proteins</subject>
                <subj-group>
                    <subject>DNA-binding proteins</subject>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Biochemistry</subject>
            <subj-group>
                <subject>Enzymology</subject>
                <subj-group>
                    <subject>Enzymes</subject>
                    <subj-group>
                        <subject>Ligases</subject>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Biochemistry</subject>
            <subj-group>
                <subject>Proteins</subject>
                <subj-group>
                    <subject>Enzymes</subject>
                    <subj-group>
                        <subject>Ligases</subject>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Biology and life sciences</subject>
        <subj-group>
            <subject>Biochemistry</subject>
            <subj-group>
                <subject>Proteins</subject>
                <subj-group>
                    <subject>Post-translational modification</subject>
                    <subj-group>
                        <subject>Ubiquitination</subject>
                    </subj-group>
                </subj-group>
            </subj-group>
        </subj-group>
    </subj-group>
    <subj-group subj-group-type="Discipline-v3">
        <subject>Research and analysis methods</subject>
        <subj-group>
            <subject>Precipitation techniques</subject>
            <subj-group>
                <subject>Immunoprecipitation</subject>
            </subj-group>
        </subj-group>
    </subj-group>
</article-categories>
fred-atherden commented 5 years ago

Of the same 1,000,000 PMC articles, only 1% (11,044) have a subj-group containing more than 1 subject elements. Example (in the subj-group[@subj-group-type="hwp-journal-coll")]:

<article-categories>
    <subj-group subj-group-type="hwp-journal-coll">
        <subject>1014</subject>
        <subject>1009</subject>
        <subject>1011</subject>
        <subject>1053</subject>
    </subj-group>
    <subj-group subj-group-type="heading">
        <subject>Research Articles</subject>
    </subj-group>
    <subj-group subj-group-type="overline">
        <subject>SPECIAL ISSUE: Scaling Effects Regulating Plant Response to Global
            Change</subject>
    </subj-group>
</article-categories>

Of this 1%, only 4,424 contain subj-group[not(@subj-group-type="hwp-journal-coll")] elements with more than one subject, which indicates that the normal capture is for a subj-group element to have only 1 subject (and a possible child subj-group - although in the majority of cases this isn't present).

Possibly noteworthy that subj-group[@subj-group-type="hwp-journal-coll")] is presumably a Highwire press requirement.

fred-atherden commented 5 years ago

Example of the same subject in separate languages (provided by Érudit):

<article-categories>
    <subj-group subj-group-type="heading" xml:lang="en">
        <subject>Front Matter</subject>
    </subj-group>
    <subj-group subj-group-type="heading" xml:lang="fr">
        <subject>Liminaire</subject>
    </subj-group>
</article-categories>

Note @xml:lang is included on the subj-group rather than the subject.

fred-atherden commented 5 years ago

Noting also that there's a JATS element, compound-subject which can be used in place of subject, inside subj-group.

This is not commonly used (627 of the 1,000,000 PMC articles contain it), so unsure whether we want to support it - certainly not needed for MVP (not in IJM, Hindawi or bioRxiv content that we've seen).

Here's an example of it's use:

<article-categories>
    <subj-group subj-group-type="heading">
        <subject>Case Report</subject>
    </subj-group>
    <subj-group subj-group-type="topic">
        <compound-subject>
            <compound-subject-part content-type="code">bjrcr</compound-subject-part>
            <compound-subject-part content-type="label">BJRCR</compound-subject-part>
        </compound-subject>
        <compound-subject>
            <compound-subject-part content-type="code">gen-trct</compound-subject-part>
            <compound-subject-part content-type="label">Genitourinary
                tract</compound-subject-part>
        </compound-subject>
        <compound-subject>
            <compound-subject-part content-type="code">ct</compound-subject-part>
            <compound-subject-part content-type="label">CT</compound-subject-part>
        </compound-subject>
    </subj-group>
</article-categories>
fred-atherden commented 5 years ago

The root article element may have an attribute @article-type. I include this information for completeness, I don't think we need to take this into consideration in this context, but I suppose we could (I don't know how it is currently used for Continuum for example). This attribute has a set of suggested values (provided by the JATS committee), which we should work from if included here.

Example:

<article article-type="research-article">
...
</article>
thewilkybarkid commented 5 years ago

I don't know how it is currently used for Continuum for example

Don't believe it's used anywhere. (Edit: don't know about the Bot.)

giorgiosironi commented 5 years ago

Thank you for all the examples, I am distilling a checklist in the XML section of https://docs.google.com/document/d/1PmLD6NEiqpjKTnjW-n1Hho_G-znp5lW3ba8PGmOHnQE/edit?ts=5d662f5c#

fred-atherden commented 5 years ago

F1000/Wellcome/Gates open res used nested subj-group element (only to one degree)

Example:

<subj-group>
    <subject>Articles</subject>
    <subj-group>
        <subject>Bioinformatics</subject>
    </subj-group>
</subj-group>

It's a bit strange - all of the parent subj-groups have a <subject>Articles</subject> - the nesting itself, doesn't seem to carry any meaning (as far as I can tell).

giorgiosironi commented 5 years ago

Should this be closed as a duplicate of https://github.com/libero/publisher/issues/256?

fred-atherden commented 5 years ago

@giorgiosironi, yes, the only reason I haven't is because I am unable to (lack the necessary permissions).

giorgiosironi commented 5 years ago

Added you to https://github.com/orgs/libero/teams/elife-developers/members so you should have the same permissions as anyone else now.

fred-atherden commented 5 years ago

Ace, thanks!