kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.5k stars 449 forks source link

GROBID detects funding information, but funder is left out in tei xml output #1144

Closed mariadelmarq closed 2 weeks ago

mariadelmarq commented 2 months ago

Hi,

Me again, sorry! I have another potential error case where the funder is correctly identified, but the name of the funder is left out of the TEI/XML output in a Cambridge University Press article. See screenshots below:

From the pdf: image

Resulting tei xml image

Because we're using NLP techniques to find the funding statements in the text of the article (I acknowledge we are potentially doubling up with what GROBID is attempting to do), this makes it really hard to identify the name of the funder, and the fact that there is a funding statement. Grateful for any ideas/advice!

lfoppiano commented 2 months ago

@mariadelmarq thanks again for reporting this issue, feel free to send me the source via email.

lfoppiano commented 2 months ago

@mariadelmarq which grobid version/environment/OS are you using?

mariadelmarq commented 2 months ago

Linux OS (Gnome Classic Desktop), running GROBID via Docker with: docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0.

I installed the python client and in my script have:

from grobid_client.grobid_client import GrobidClient

client = GrobidClient(config_path="./config.json")

client.process("processFulltextDocument", fulltext_dir, output = grobid_path)
lfoppiano commented 2 months ago

Thanks. I've checked and for this bug, is going to be fixed in the coming version 0.8.1. We can leave it open and after the release I will double check.

mariadelmarq commented 2 months ago

Brilliant, thanks heaps!!!

lfoppiano commented 2 weeks ago

@mariadelmarq I want to double check on this, with version 0.8.1 the result seems correct:

<div type="funding">
                <div>
                    <p>Funding was provided by the 
                        <rs type="funder">Children's Trust, Massachusetts</rs>, Grant 
                        <rs type="grantNumber">5014</rs>. We are grateful for the support of colleagues at the 
                        <rs type="affiliation">Tufts Interdisciplinary Evaluation Research Group</rs> and for the participation of the research participants.
                    </p>
                </div>
            </div>
            <listOrg type="funding">
                <org type="funding" xml:id="_sNGGdEJ">
                    <idno type="grant-number">5014</idno>
                </org>
            </listOrg>

under titleStmt we have also the funder's name:

 <funder ref="#_sNGGdEJ">
                    <orgName type="full">Children's Trust, Massachusetts</orgName>
                </funder>

May we consider this as the correct output?

mariadelmarq commented 2 weeks ago

@lfoppiano looks perfect, thanks so much!