Grobid segmentation performs relatively poorly on Nature articles.

kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

https://grobid.readthedocs.io

Apache License 2.0

3.6k stars 461 forks source link

Grobid segmentation performs relatively poorly on Nature articles. #1061

Open bowenng opened 1 year ago

bowenng commented 1 year ago

Good morning. :)

Grobid has been generally precise for PDF segmentation on research articles similar to the formats found arxiv.

However, segmentation consistently misses sections (i.e. head tag) and produces duplicated sections when run on Nature publications. (E.g. https://www.nature.com/articles/s41597-022-01908-z).

I wonder if this is because Grobid models are not trained on the format of Nature publications, which appear to contain more visual information. Because Nature is very well-known source for scientific literature, I wonder if there is any interest in improving parsing accuracy for Nature articles.

I am using the latest 0.7.3 release with docker.

kermitt2 commented 1 year ago

Hi @bowenng

Thanks for raising this issue !

Nature articles used to be under copyrights without sharing license, so it was not possible to add them in the training data. But since a few years we have now quite a lot of CC-BY articles, so indeed we can and need to add some annotated examples to improve the results.

kermitt2 commented 1 year ago

Just as an additional comment, I tested the example PDF with the current version 0.8.0 pre-release (in demo at https://kermitt2-grobid.hf.space/) and did not see any duplicated sections. Some section headers are still overlooked (not the main ones in blue, but the ones in bold as paragraph heading). A few annotated examples in the full text model should do the job (there's very few training data for this model, only ~40 articles, which is bad, but on the other hand adding only one or two examples already impact positively the model!).

bowenng commented 12 months ago

Hi @kermitt2 thank you for your prompt response! It's good to know that 0.8.0 has improvements. I'll try to annotate some nature articles and see if there is any improvements. I'm happy to share the training data if that is helpful to you afterwards. Thank you for maintaining the awesome project.