kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.62k stars 461 forks source link

Inconsistent file naming of header corpus #467

Open de-code opened 5 years ago

de-code commented 5 years ago

Hi @kermitt2

I found the naming of the header evaluation corpus confusing. Is there any reason the TEI and features file do not use the same common name?

e.g. for the 322._499236.header matches up with header317.tei

The training corpus seems to be better. Although there some files that do not seem to intuitively match up. e.g. what TEI file does 609._a_ray_tracing_method_for_illumination_ca_298160.header seem to match grobid-trainer/resources/dataset/header/corpus/tei/header607.tei.

The TEI XML contains a fileDesc id pointing to the beginning of the feature file. But it is a bit difficult to follow.

kermitt2 commented 5 years ago

Hi @de-code !

This difference comes from historical reasons, and actually very old ones! I started with the CORA training data for header recognition, which was 1000 annotated headers (just some texts extracted with some pdf2ascii).

Then I introduced PDF, layout features, and so on to avoid working only with crappy extracted text. To still exploit the CORA training data, I tried to retrieved all the PDF in Open Access corresponding to these 1000 annotated headers (those 1000 headers were supposed to be from OA PDF). I did that in 2010 :)

It appears that some PDF were not available anymore in OA, or not in the same version. So I removed those headers. I also stopped around 900 headers at some point and never finished to find the PDF for the last 100 if I remember well (this was more a hobby at the time... and I probably found more interesting to play with the dogs than finishing this repetitive task).

Anyway header317.tei is typically header317 of the CORA corpus converted into TEI (automatically) which is matched with downloaded PDF 322 because probably 4 PDF before reaching this number were throwed out for some reasons.

However normally for the additional header files beyond this 1000 initial set, the name should match if I remember well.

So this is why the names are not matching, which was your initial question, but I would be happy to have something better organized and renaming everything uniformly at some point with some random identifiers or DOI.

de-code commented 5 years ago

Hi @kermitt2

Thank you for explaining that.

I don't blame you for preferring to procrastinate.

I think if the files had a particular name in the source corpus, it probably makes sense to retain it. Maybe add a prefix, e.g. cora- (or maybe even as sub-directories). Other than that I don't have much of a preference regarding the naming. My only suggestion would be to keep the filename in sync to avoid confusion. e.g. cora-322._499236.header and cora-322._499236.tei.xml. I guess that could be automated to avoid severe boredom. I also just noticed that those files do not have the .xml extension while others do.