kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.58k stars 458 forks source link

Word document instead of PDF #313

Open sarankup opened 6 years ago

sarankup commented 6 years ago

Hi,

At present, I have all documents as DOCX (Microsoft Word files) which I convert to PDF in order to run the GROBID XML conversion. Is there any possibility of using DOCX as input?

In case of PDF is the only input-option it is reliable that the full-text extraction is always reliable in terms 100% content integrity even if XML markup is incorrect. We are fine, in case of any incorrect XML markup, but not if there are any content loss.

lfoppiano commented 6 years ago

@sarankup Currently PDF format is the only input for documents in Grobid. Supporting several format is quite demanding to implement and, moreover, to maintain so Grobid supports the more widely format for articles and monographs.

kermitt2 commented 6 years ago

one way to support docx file would be to create an XML parser similar to the current parser for pdf2xml (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/sax/PDF2XMLSaxHandler.java) or the future one for ALTO (https://github.com/kermitt2/grobid/blob/pdfalto_integration/grobid-core/src/main/java/org/grobid/core/sax/PDFALTOSaxHandler.java).

An alternative is to create a stylesheet for transforming docx to ALTO file.

But I have no clue if any of these options are doable :)

axfelix commented 6 years ago

It would probably be more sustainable long term to just add an unoconv service to Grobid for converting Word documents to PDF so they can go through the existing Grobid toolchain. I'm a longtime contributor to https://github.com/MartinPaulEve/meTypeset which parses Word documents to JATS XML through a series of complicated python and XSLT rules -- it's included in https://github.com/pkp/ots along with Grobid, and different parsers are used depending on input -- but it's harder to support in the long term because Microsoft's XML is very idiosyncratic and changes often.

kermitt2 commented 5 years ago

My idea now would be to convert docx into ALTO. In principle any fixed-layout document could be transformed into ALTO, which is now the GROBID standard input.

kermitt2 commented 4 years ago

see PR #515