Closed xegulon closed 2 years ago
Hi @xegulon
Thank you for using Grobid and suggesting new features !
It appears that there is already an undocumented option (generateIDs
) in the service to generate id.
curl --form input=@/home/lopez/test/in/Briere_Plant_Cell_Physiol_2003.pdf --form generateIDs=1 --form segmentSentences=1 localhost:8070/api/processFulltextDocument
<p xml:id="_xD8Yz2m"><s xml:id="_h5pmvVG">Seeds of sunflower (Helianthus annuus L., genotype EMIL Pioneer France Maïs) were sterilized in 5% calcium hypochloride for 20 min, then rinsed in sterile water and cultivated on MS medium <ref type="bibr" target="#b25">(Murashige and Skoog, 1962)</ref> at 25°C under a 16-hour photoperiod (light intensity 25µE.m -2 .s -</s><s xml:id="_ejqg8p9"> ).</s><s xml:id="_Vg8D6w2">Protoplasts were isolated from 8-day old plantlet hypocotyls according to the protocol of <ref type="bibr" target="#b6">Chanabé et al. (1989)</ref>.</s><s xml:id="_9wQwaAE">After purification, protoplasts were embedded at a final density of 1.5 10 5 .mL -</s><s xml:id="_yJzNH92"> in TLD medium <ref type="bibr" target="#b7">(Chanabé et al., 1991)</ref> containing 0.5% Sea Plaque agarose (FMC Bioproduct, Rockland, USA), and before solidification small drops (40 µl) of this mix were spread on poly-L-lysine (Sigma)-coated coverglasses.</s><s xml:id="_nDTugmx">After solidification of the drops, embedded protoplasts were submerged in 1mL of TLD medium and cultured in the dark at 25°C.</s></p>
The _
prefix ensure than xml:id
are compliant.
Why is it undocumented? I don't remember.
Great! Are these IDs generated in a way to be unique across all documents?
Great! Are these IDs generated in a way to be unique across all documents?
Yes this is decent enough random keys to be sure to be safe in the document:
Great, then I should close this now. Thanks!
Feature request
Hi,
I have a use case where I need Grobid to process thousands of PDFs (with the option of segmenting sentences activated), then extract the sentences from the PDF, then put all the sentence in acsv that I would load as a Spark DataFrame, process the DataFrame (perform several sentence classification tasks), and then, after each sentence is processed, make the link between the results and the processed PDFs.
So the feature I would like to have is an option
-addIDs
, that would add IDs to paragraphs and sentences in the output TEI files. IDs would be unique accross all the documents in the processed batch.Example of what I want
Current Grobid TEI:
After
-addIDs
option activated:One should be able to set the length of the IDs, choose the ID generator function, etc.