kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 443 forks source link

Identification of affiliations located at the end of a document and explicit identification of reviewers #117

Open AlainMonteil opened 7 years ago

AlainMonteil commented 7 years ago

Certains documents mentionnent les affiliations des auteur à la fin, c'est en particulier le cas des articles de BMC. voici un exemple : http://biologydirect.biomedcentral.com/track/pdf/10.1186/s13062-016-0143-4?site=biologydirect.biomedcentral.com en fin de pdf un paragraphe Author details cela aiderai le dépôt dans HAL.

kermitt2 commented 7 years ago

Absolutely... I think it is mainly a matter of improving the training data for the segmentation model (and possibly too the header model), so that the affiliation block at the end of the document is correctly attached to the header segments.

kermitt2 commented 4 years ago

After 4 years of hard work (10 hours per day) just on this issue, it is now working Alain ! ;)

Grobid is so enthusiastic that it even finds and adds the reviewers, next step is to add a @role="reviewer" otherwise they just look like normal authors.

                   <analytic>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Tommaso</forename>
                                <surname>Lorenzi</surname>
                            </persName>
                            <affiliation key="aff0">
                                <orgName type="department">School of Mathematics and Statistics</orgName>
                                <orgName type="institution" key="instit1">University of St Andrews</orgName>
                                <orgName type="institution" key="instit2">North Haugh</orgName>
                                <address>
                                    <postCode>KY16 9SS</postCode>
                                    <settlement>St Andrews</settlement>
                                    <country key="GB">UK</country>
                                </address>
                            </affiliation>
                            <affiliation key="aff1">
                                <orgName type="department">School of Mathematics and Statistics</orgName>
                                <orgName type="institution">University of St Andrews</orgName>
                                <address>
                                    <addrLine>North Haugh</addrLine>
                                    <postCode>KY16 9SS</postCode>
                                    <settlement>St Andrews</settlement>
                                    <country key="GB">UK</country>
                                </address>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Rebecca</forename>
                                <forename type="middle">H</forename>
                                <surname>Chisholm</surname>
                            </persName>
                            <affiliation key="aff2">
                                <orgName type="department">School of Biotechnology and Biomolecular Sciences</orgName>
                                <orgName type="institution">University of New</orgName>
                                <address>
                                    <addrLine>South Wales, NSW</addrLine>
                                    <postCode>2052</postCode>
                                    <settlement>Sydney</settlement>
                                    <country key="AU">Australia</country>
                                </address>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Jean</forename>
                                <surname>Clairambault</surname>
                            </persName>
                            <affiliation key="aff3">
                                <orgName type="institution">INRIA</orgName>
                                <address>
                                    <addrLine>Paris Research Centre, MAMBA team, 2, rue Simone Iff, Paris Cedex 12</addrLine>
                                    <postCode>42112, 75589</postCode>
                                    <region>CS</region>
                                    <country key="FR">France</country>
                                </address>
                            </affiliation>
                            <affiliation key="aff4">
                                <orgName type="laboratory">UMR 7598</orgName>
                                <orgName type="institution" key="instit1">Sorbonne Universités</orgName>
                                <orgName type="institution" key="instit2">UPMC Univ</orgName>
                                <address>
                                    <addrLine>Paris 6</addrLine>
                                </address>
                            </affiliation>
                            <affiliation key="aff5">
                                <orgName type="laboratory">Laboratoire Jacques-Louis Lions</orgName>
                                <address>
                                    <addrLine>Boîte courrier 187, 4 Place Jussieu, Paris Cedex 05</addrLine>
                                    <postCode>75252</postCode>
                                    <country key="FR">France</country>
                                </address>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Angela</forename>
                                <surname>Pisco</surname>
                            </persName>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Sébastien</forename>
                                <surname>Benzekry</surname>
                            </persName>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Heiko</forename>
                                <surname>Enderling</surname>
                            </persName>
                        </author>
                        <title level="a" type="main">Tracking the evolution of cancer cell populations through the mathematical lens of phenotype-structured equations</title>
                    </analytic>
AlainMonteil commented 4 years ago

It's great ! :-) and yes next step is to recognize reviewer role and of course revision texte in the full text TEI !