kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.26k stars 439 forks source link

Add option to generate identifier to each paragraph and sentence in the output TEI #835

Closed xegulon closed 2 years ago

xegulon commented 2 years ago

Feature request

Hi,

I have a use case where I need Grobid to process thousands of PDFs (with the option of segmenting sentences activated), then extract the sentences from the PDF, then put all the sentence in acsv that I would load as a Spark DataFrame, process the DataFrame (perform several sentence classification tasks), and then, after each sentence is processed, make the link between the results and the processed PDFs.

So the feature I would like to have is an option -addIDs, that would add IDs to paragraphs and sentences in the output TEI files. IDs would be unique accross all the documents in the processed batch.

Example of what I want

Current Grobid TEI:

<p>
<s>Hi, I'm sentence no. 1</s> 
<s>And me is no. 2</s>
</p>

After -addIDs option activated:

<p id="v7CWseUQjFJuK4GS">
<s id="jcFauTEN3K9NWVtvQu2u">Hi, I'm sentence no. 1</s> 
<s id="K3zmpKFw2z6UYFE7GGtB">And me is no. 2</s>
</p>

One should be able to set the length of the IDs, choose the ID generator function, etc.

kermitt2 commented 2 years ago

Hi @xegulon

Thank you for using Grobid and suggesting new features !

It appears that there is already an undocumented option (generateIDs) in the service to generate id.

curl --form input=@/home/lopez/test/in/Briere_Plant_Cell_Physiol_2003.pdf --form generateIDs=1 --form segmentSentences=1 localhost:8070/api/processFulltextDocument

<p xml:id="_xD8Yz2m"><s xml:id="_h5pmvVG">Seeds of sunflower (Helianthus annuus L., genotype EMIL Pioneer France Maïs) were sterilized in 5% calcium hypochloride for 20 min, then rinsed in sterile water and cultivated on MS medium <ref type="bibr" target="#b25">(Murashige and Skoog, 1962)</ref> at 25°C under a 16-hour photoperiod (light intensity 25µE.m -2 .s -</s><s xml:id="_ejqg8p9"> ).</s><s xml:id="_Vg8D6w2">Protoplasts were isolated from 8-day old plantlet hypocotyls according to the protocol of <ref type="bibr" target="#b6">Chanabé et al. (1989)</ref>.</s><s xml:id="_9wQwaAE">After purification, protoplasts were embedded at a final density of 1.5 10 5 .mL -</s><s xml:id="_yJzNH92"> in TLD medium <ref type="bibr" target="#b7">(Chanabé et al., 1991)</ref> containing 0.5% Sea Plaque agarose (FMC Bioproduct, Rockland, USA), and before solidification small drops (40 µl) of this mix were spread on poly-L-lysine (Sigma)-coated coverglasses.</s><s xml:id="_nDTugmx">After solidification of the drops, embedded protoplasts were submerged in 1mL of TLD medium and cultured in the dark at 25°C.</s></p>

The _ prefix ensure than xml:id are compliant.

Why is it undocumented? I don't remember.

xegulon commented 2 years ago

Great! Are these IDs generated in a way to be unique across all documents?

kermitt2 commented 2 years ago

Great! Are these IDs generated in a way to be unique across all documents?

Yes this is decent enough random keys to be sure to be safe in the document:

https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/utilities/KeyGen.java

xegulon commented 2 years ago

Great, then I should close this now. Thanks!