kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.6k stars 460 forks source link

Author roles recognized as authors #290

Open borkdude opened 6 years ago

borkdude commented 6 years ago

I'm not sure if I should post this as an issue.

I posted this document to api/processFulltextDocument and noticed I got back some authors that were roles of other authors.

For example Alona Muzikansky, M.A comes back as two authors:

                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">Alona</forename>
                                <surname>Muzikansky</surname>
                            </persName>
                            <affiliation key="aff0">
                                <orgName type="department">Department of Anes-thesiology</orgName>
                                <orgName type="laboratory">the State Uni-versity of New York, Buffalo (S.A.); Adult Palliative Medicine</orgName>
                                <orgName type="institution" key="instit1">From Massachusetts General Hospital</orgName>
                                <orgName type="institution" key="instit2">Columbia University Medical Center</orgName>
                                <orgName type="institution" key="instit3">Yawkey 7B</orgName>
                                <address>
                                    <postCode>02114</postCode>
                                    <settlement>Boston</settlement>
                                    <region>New York (C.D.B.);, MA</region>
                                </address>
                            </affiliation>
                        </author>
                        <author>
                            <persName
                                xmlns="http://www.tei-c.org/ns/1.0">
                                <forename type="first">M</forename>
                                <forename type="middle">A</forename>
                            </persName>
                            <affiliation key="aff0">
                                <orgName type="department">Department of Anes-thesiology</orgName>
                                <orgName type="laboratory">the State Uni-versity of New York, Buffalo (S.A.); Adult Palliative Medicine</orgName>
                                <orgName type="institution" key="instit1">From Massachusetts General Hospital</orgName>
                                <orgName type="institution" key="instit2">Columbia University Medical Center</orgName>
                                <orgName type="institution" key="instit3">Yawkey 7B</orgName>
                                <address>
                                    <postCode>02114</postCode>
                                    <settlement>Boston</settlement>
                                    <region>New York (C.D.B.);, MA</region>
                                </address>
                            </affiliation>
                        </author>
kermitt2 commented 6 years ago

Hello! Thank you for reporting this, we can consider it as an issue :) I think this is due to the lack of training data for this sort of person titles in the author-header model (in particular M.A. and M.P.H., there might not even be a single case). I'll try to add more examples and update the model.