Open cboulanger opened 3 years ago
Maybe this would be a question for a user forum - I saw something about a Mattermost Channel - does it still exist?
Hello @cboulanger !
Yes indeed it would be easier, also to share some examples. Please send me an email, I'll invite you to our mattermost den.
Done. I'll keep the issue open to post a summary of a solution if one can be found, ok?
Hi, I am trying to train GROBID to deal with German-language sociology of law scholarship. I have a collection of PDFs from four decades of journal issues. The older ones exists only as scanned images (the majority of PDFs), newer ones are native PDFs. I have JATS data (from the publisher de Gruyter) for all of them, but only a small percentage of them contains actual article and citation data - most contain only the article metadata. I've run most of the PDFs containing images through the Abbyy OCR service and therefore have high-quality OCR'ed PDF/A files.
Out of the box, GROBID does not produce anything useful, which is probably not surprising as structure and content of my articles are very different from the English-language natural science articles it has been trained with.
With the mixed bag of stuff that I have at my disposal, how do I best approach the training?
Thank you for any pointers.