kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 444 forks source link

Support nested document sections #377

Open kermitt2 opened 5 years ago

kermitt2 commented 5 years ago

Sections are currently "flat", they are not further structured and nested in the resulting XML. The numbering scheme used in the section header is not analysed to identify hierarchical relations in the different document sections.

First attempts were not considered reliable enough and was resulting in ill-formed resulting TEI, so it was put on hold...

This was discussed in #366 and in at least one other, but I don't find it :D

karatekaneen commented 4 years ago

Just a quick question for clarification: Is the same true for the segmentation model? i.e xml like the following would be considered invalid:

some text here
<listBibl>
    reference 1
    reference 2
    <note place="footnote">
        Some footnote
    </note>
    reference 3
</listBibl>
lfoppiano commented 2 months ago

I provide another example of a document where the sub-headers are flatten out, in brief:

image

Gets converted as:

<div xmlns="http://www.tei-c.org/ns/1.0"><head xml:id="_NjPUbq2">MATERIALS AND METHODS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head xml:id="_6NQnMjR">Pseudogenes and parental genes Pseudogene and parent gene annotations</head><p xml:id="_DQbKzQV"><s xml:id="_vQHXNPn" coords="10,303.00,434.72,255.00,9.60;10,303.00,445.72,255.06,9.60;10,303.00,456.73,17.80,9.60">Pseudogene annotations were obtained from GENCODE v38 (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/) <ref type="bibr" coords="10,303.00,456.73,14.24,9.60" target="#b24">(25)</ref>.</s><s xml:id="_zVw5QBu" coords="10,323.86,456.73,234.14,9.60;10,303.00,467.73,101.64,9.60">We included all HAVANA annotated pseudogenes excluding polymorphic pseudogenes.</s><s xml:id="_wjpkyue" coords="10,406.43,467.73,151.57,9.60;10,303.00,478.73,254.99,9.60;10,303.00,489.73,255.00,9.60;10,303.00,500.73,255.00,9.60;10,303.00,511.63,254.99,10.91;10,303.00,522.73,255.01,9.60;10,303.00,533.63,255.00,10.91;10,303.00,544.73,255.00,9.60;10,303.00,555.63,222.38,10.91">Biotypes were clustered using the "gene_ type" column so that "IG_V_pseudogene, " "IG_C_pseudogene, " "IG_J_pseudogene," "IG_pseudogene," TR," "TR_J_pseudogene," "TR_V_pseudogene, " "transcribed_unitary_pseudogene, " "unitary_ pseudogene" = "Unitary"; "rRNA_pseudogene, " "pseudogene" = "Other"; "transcribed_unprocessed_pseudogene," "unprocessed_ pseudogene," "translated_unprocessed_pseudogene" = "Unprocessed"; "processed_pseudogene, " "transcribed_processed_pseudogene," "translated_processed_pseudogene" = "Processed.</s><s xml:id="_3AbdRmM" coords="10,525.38,555.73,32.62,9.60;10,303.00,566.74,255.03,9.60;10,303.00,577.74,202.47,9.60">" Parent genes have previously been inferred <ref type="bibr" coords="10,445.36,566.74,15.64,9.60" target="#b25">(26)</ref> and were obtained from psiCube (http://pseudogene.org/psicube/index.html).</s></p></div>

The first div, has header MATERIALS AND METHOS and empty body, and the subheader is shifted in the following div.

The CC document is here