kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Fix heading annotation in fulltext evaluation and add header levels #1105

Open Schroedi opened 7 months ago

Schroedi commented 7 months ago

<head>Methods<lb/> Preparation<lb/></head> are actually two headings in the original paper.

lfoppiano commented 7 months ago

@Schroedi Great.

Did you check also the segmentation model training output files? If they were perfectly fine we should not include them, but if they had to be corrected, then we should also add them.

See comment here: https://github.com/kermitt2/grobid/issues/1067#issuecomment-1888503015

Schroedi commented 6 months ago

The segmentation looked fine to me.

Could you share a folder with the PDFs the (fulltext) training data are from? I am currently fetching them one by one.

kermitt2 commented 6 months ago

hi @Schroedi please send me an email (https://grobid.readthedocs.io/en/latest/Introduction/#credits) and I will send you back the info for accessing the PDF repository used for the training data.

lfoppiano commented 2 days ago

@Schroedi I'm going through the PRs and I was wondering if this PR is completed. I see only one file changed for this PR however I see there are other files for which the level attribute can be added.

I also wantes to thank you anyway for your contributions and sorry for being slow in integrating into Grobid.