kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.57k stars 457 forks source link

Feature Request: group authors #650

Open de-code opened 4 years ago

de-code commented 4 years ago

There currently doesn't seem to be support for group authors.

Example: 048991v1 (10.1101/048991) with more author groups.

PDF:

image

GROBID 0.6.1 extracted them as affiliations:

<author>
    <affiliation key="aff0">
        <note type="raw_affiliation">IGAP consortium, IHGC consortium, ILAE Consortium on Complex Epilepsies, IMSGC consortium, IPDGC consortium, METASTROKE and Intracerebral Hemorrhage Studies of the International Stroke Genetics Consortium, Attention-Deficit Hyperactivity Disorder Working Group of the Psychiatric Genomics Consortium, Anorexia Nervosa Working Group of the Psychiatric Genomics Consortium, Autism Spectrum Disorders Working Group of The Psychiatric Genomics Consortium, Bipolar Disorders Working Group of the Psychiatric Genomics Consortium, Major Depressive Disorder Working Group of the Psychiatric Genomics Consortium, Obsessive Compulsive Disorder and Tourette Syndrome Working Group of the Psychiatric Genomics Consortium, Schizophrenia Working Group of the</note>

Another example with group authors but GROBID seem to have failed to extract the authors:

269571v1 (10.1101/269571)

PDF:

image

bioRxiv XML:

<contrib contrib-type="author">
    <collab>Dyslexia Data Consortium</collab>
</contrib>

I couldn't find support for it in the code.

kermitt2 commented 4 years ago

"Collaboration" are supported in the bibliographical references since a few years (the effort was driven by HEP!), it works well if I remember well, but beyong HEP collaborations, there is almost no training example with "consortium" currently to extend the coverage.

For the header, they are annotated in the new header training data as <note type="group">:

<byline>
    <docAuthor>Zuo-Teng Wang 1 , Shi-Dong Chen 2 , Wei Xu 1 , Ke-Liang Chen 2 , Hui-Fu Wang 3 , Chen-Chen Tan 3 , Mei<lb/> Cui 2 , Qiang Dong 2 , Lan Tan 1,3 , Jin-Tai Yu 2 , </docAuthor>
    </byline>

    <note type="group">Alzheimer&apos;s Disease Neuroimaging Initiative *</note>

and label group by the sequence labelling... but not present in the output because there's not enough training data yet.