kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.51k stars 450 forks source link

Abstract lost in a HAL document #965

Open lfoppiano opened 1 year ago

lfoppiano commented 1 year ago

In this document https://hal-univ-bourgogne.archives-ouvertes.fr/hal-00702344

the abstract seems to be replaced by the HAL frontpage description

<abstract>
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <p>HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.</p>
                    <p>L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.</p>
                </div>
            </abstract>
kermitt2 commented 1 year ago

Thanks Luca!

There are a few examples of HAL cover page in the segmentation model training data (4 or 5). It's not a lot but I would expect that it's enough to work in general... I think at some point we should revisit the segmentation model to try to capture more features (it's very basic right now to be honest) or make a large-scale test on all HAL archive with cover page to detect interesting error cases related to cover pages?

In general, removing the automatically-added HAL cover page for the HAL articles leads to good results - for example, in this case, the abstract is correct without the cover page.