Open sarankup opened 6 years ago
@sarankup Currently PDF format is the only input for documents in Grobid. Supporting several format is quite demanding to implement and, moreover, to maintain so Grobid supports the more widely format for articles and monographs.
one way to support docx file would be to create an XML parser similar to the current parser for pdf2xml (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/sax/PDF2XMLSaxHandler.java) or the future one for ALTO (https://github.com/kermitt2/grobid/blob/pdfalto_integration/grobid-core/src/main/java/org/grobid/core/sax/PDFALTOSaxHandler.java).
An alternative is to create a stylesheet for transforming docx to ALTO file.
But I have no clue if any of these options are doable :)
It would probably be more sustainable long term to just add an unoconv service to Grobid for converting Word documents to PDF so they can go through the existing Grobid toolchain. I'm a longtime contributor to https://github.com/MartinPaulEve/meTypeset which parses Word documents to JATS XML through a series of complicated python and XSLT rules -- it's included in https://github.com/pkp/ots along with Grobid, and different parsers are used depending on input -- but it's harder to support in the long term because Microsoft's XML is very idiosyncratic and changes often.
My idea now would be to convert docx into ALTO. In principle any fixed-layout document could be transformed into ALTO, which is now the GROBID standard input.
see PR #515
Hi,
At present, I have all documents as DOCX (Microsoft Word files) which I convert to PDF in order to run the GROBID XML conversion. Is there any possibility of using DOCX as input?
In case of PDF is the only input-option it is reliable that the full-text extraction is always reliable in terms 100% content integrity even if XML markup is incorrect. We are fine, in case of any incorrect XML markup, but not if there are any content loss.