kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

move TEI idno identifiers under <analytics> #1193

Open lfoppiano opened 3 weeks ago

lfoppiano commented 3 weeks ago

See #1192 . The same treatment is applied to any identifier: PMCID, PMID, halID, etc

See example:

                                </address>
                            </affiliation>
                        </author>
                        <title level="a" type="main">Transgressive phenotypes from outbreeding between the Trichoderma reesei hyper producer RutC30 and a natural isolate</title>
                        <idno type="DOI">10.1128/spectrum.00441-24</idno>
                    </analytic>
                    <monogr>
                        <imprint>
                            <date type="published" when="2024-08-20">20 August 2024</date>
                        </imprint>
                    </monogr>
                    <idno type="MD5">9E9A05DAEBD10C49EB098AF73FA55CD1</idno>
                    <idno type="DOI" status="deprecatedLocation">10.1128/spectrum.00441-24</idno>
                    <note type="submission">Received 22 February 2024 Accepted 3 July 2024</note>
                </biblStruct>
coveralls commented 3 weeks ago

Coverage Status

coverage: 40.766% (+0.01%) from 40.755% when pulling 60d7a19fa1f47eeb8f1026ebe7e0b1f3090ba4ef on bugfix/move-idno-under-analytics into be44579606f3953473119edf5e34701aad9f1a55 on master.

kermitt2 commented 3 weeks ago

Hi Luca !

Thinking about it a second time, it might be more complicated than that and I think I understand the motivation for letting the identifiers under in the case of Grobid.

If we extract a raw DOI from a PDF, the position of the DOI in normal TEI depends on the type of document we process, either under if this is a part of a monograph or journal, or under if we have a report, standalone article, etc. As we can't know for sure about the document type, it make sense to put it under in Grobid, to avoid errors, as a relaxed rule.

Second, we have HAL ID which corresponds to standalone document, same for arXiv ID. We cannot put them under , but it does not make sense to put it under neither. is a valid default choice that explicitly capture the fact that we cant know the level of the document.

The idea behind this is: when we don't know the level associated to a extracted ID, we explicitly put it under .

@laurentromary for comments :)

kermitt2 commented 3 weeks ago

Note to myself: check how the DOI/ids are positioned in the case of consolidated header. We can have 2 different DOIs, one for the part and one for the "hosting" document.