kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.62k stars 461 forks source link

Train GROBID with mixed material (OCR, JATS) #767

Open cboulanger opened 3 years ago

cboulanger commented 3 years ago

Hi, I am trying to train GROBID to deal with German-language sociology of law scholarship. I have a collection of PDFs from four decades of journal issues. The older ones exists only as scanned images (the majority of PDFs), newer ones are native PDFs. I have JATS data (from the publisher de Gruyter) for all of them, but only a small percentage of them contains actual article and citation data - most contain only the article metadata. I've run most of the PDFs containing images through the Abbyy OCR service and therefore have high-quality OCR'ed PDF/A files.

Out of the box, GROBID does not produce anything useful, which is probably not surprising as structure and content of my articles are very different from the English-language natural science articles it has been trained with.

With the mixed bag of stuff that I have at my disposal, how do I best approach the training?

Thank you for any pointers.

cboulanger commented 3 years ago

Maybe this would be a question for a user forum - I saw something about a Mattermost Channel - does it still exist?

kermitt2 commented 3 years ago

Hello @cboulanger !

Yes indeed it would be easier, also to share some examples. Please send me an email, I'll invite you to our mattermost den.

cboulanger commented 3 years ago

Done. I'll keep the issue open to post a summary of a solution if one can be found, ok?