kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.4k stars 444 forks source link

Help with PDF generation from Word #839

Open pakojil opened 2 years ago

pakojil commented 2 years ago

Hi Patrice First of all, I would like to thank you and the rest of the project collaborators for the great effort you make. I am exhaustively checking Grobid for the production of JATS versions of the articles.

Regarding the conversion of historical PDFs, it is clear to me that, despite the training (which I honestly don't know how to use properly), everything has a solution via XSLT transformation from TEI to JATS.

However, I am trying to create a template in Word for future articles. This is due to the fact that I am unable to find any tool that works properly for direct conversion from docx to JATS, and I have tried practically all of them.

My question is if there is any existing template, and if, on the other hand, there is some way to generate the PDF so that Grobid better identify the fields for the generation of the TEI version. I mean, is there any version of Word better than another? What is the most suitable PDF producer for this?

Thank you very much and apologize for any inconvenience.

kermitt2 commented 2 years ago

Hi @pakojil

Thank you very much for the feedback on Grobid !

Having a XSLT to transform Grobid's TEI into JATS would be really nice. It was discussed at some point (see issue #98), but it seems not progressing (there is a JATS -> TEI available on the contrary ;). On my side, I left this work to others to concentrate on more core issues in Grobid (I don't have a lot of time for Grobid unfortunately).

if there is some way to generate the PDF so that Grobid better identify the fields for the generation of the TEI version. I mean, is there any version of Word better than another? What is the most suitable PDF producer for this?

About this second question, I think I am also not going to be very helpful...

First it's better to export PDF from Word. There is a working branch supporting docx input via transformation to PDF and grobid processing of the PDF using Apache POI (#515). However, the performance was not satisfactory, with a failure rate of around 5% for me and very slow transformation process. I was planning to test docx4j. There is also no open source solution for .doc for the moment. So I would say the best solution is using the proprietary Word PDF export.

Then I don't think there are many options for the Word "save as PDF". Quality of the PDF has no impact (it's just for the quality of embedded images).

It would be interesting indeed to identify Word templates that work better with Grobid. In general using different font sizes for title, section headers, and using large paragraph separation and indents always help Grobid :) Due to the lack of training data for social science and the humanities, references as footnotes are not well supported for the moment. Finally using common fonts (avoiding proprietary fonts that can't be easily embedded) and avoiding special characters if possible (that might not be solved properly via unicode mapping) always help.

pakojil commented 2 years ago

Hi @kermitt2

Thank you very much for your kind answer.

I work on a double stage. On the one hand, we have a portal with more than 40 scientific journals. Of these, the vast majority are from the humanities and social sciences. There is a part of the challenge which is trying to get a valid JATS from historically uploaded PDFs (one part of them are scanned, and the other is generated directly from various sources). I'm working in that, thinking about it.

But, on the other hand, and it is what interests me the most right now, we try to provide a template in Word to the authors, so that the correct tagging of the document is automated to the maximum, leaving only minor adjustments for the editors of the journals or technical person involved.

With your information, I am better oriented, and I continue to do so. I will comment on my progress, if it happens.

Thank you very much again and greetings