kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.62k stars 461 forks source link

Running the java example on Windows #633

Open navraj28 opened 4 years ago

navraj28 commented 4 years ago

Hello! I am trying to run the Java example, as per. I built Grobid against the Master branch. I see that signature of Engine.processHeader has changed from boolean to int. I tried both 0 & 1. engine.processHeader(pdfPath, 0, resHeader);

I am getting the below stack-trace:
INFO: Loading model: D:\grobid\grobid-home\models\header\model.wapiti (size: 15734670)
org.grobid.core.exceptions.GrobidException: [PDFTOXML_CONVERSION_FAILURE] PDF to XML conversion failed on pdf file .\src\test\resources\Wang_paperAVE2008.pdf 
null
    at org.grobid.core.document.DocumentSource.processPdfToXmlThreadMode(DocumentSource.java:208)
    at org.grobid.core.document.DocumentSource.pdf2xml(DocumentSource.java:154)
    at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:63)
    at org.grobid.core.document.DocumentSource.fromPdf(DocumentSource.java:49)
    at org.grobid.core.engines.HeaderParser.processing(HeaderParser.java:101)
    at org.grobid.core.engines.Engine.processHeader(Engine.java:385)
    at org.grobid.core.engines.Engine.processHeader(Engine.java:374)
    at org.grobidExample.MyGrobid.runGrobid(MyGrobid.java:38)
    at org.grobidExample.App.main(App.java:11)
[Wapiti] Loading model: "D:\grobid\grobid-home\models\header\model.wapiti"
Model path: D:\grobid\grobid-home\models\header\model.wapiti
kermitt2 commented 4 years ago

Hello @navraj28 As indicated in the documentation, the latest release (0.6.1) and current master do not support the Windows platform, pdfalto has not been recompiled for this platform. You will need to use the docker image to have it running on Windows.

lfoppiano commented 4 years ago

@navraj28 https://grobid.readthedocs.io/en/latest/Troubleshooting/#windows-related-issues

You can use the docker image and the java/python client

navraj28 commented 4 years ago

@lfoppiano Strangely, I am able to run the almost exact same code in the Unit Tests (TestHeaderParser.testHeaderHeader), but not the example Java API code.

kermitt2 commented 4 years ago

@navraj28 it's not strange, the test isolates the process from pdfalto and from the stuff outside the HeaderParser class (the unit of the test).

navraj28 commented 4 years ago

Thank you @kermitt2

ttsigg commented 4 years ago

Are there plans to have it working again on Windows? I'm unable to update past 0.6.0 at present

lfoppiano commented 4 years ago

@siggins we can't ensure the compatibility with Windows as the work to maintain all the three OS is quite big, however, we would be happy to get help from someone more familiar with it.

ttsigg commented 4 years ago

What's needed to get support working again on Windows? I may be able to help

kermitt2 commented 4 years ago

Hi @siggins ! Currently the only piece missing for Windows is to recompile pdfalto binary (using Cygwin) and move it under grobid-home/pdf2xml/win-64/pdfalto/.

lfoppiano commented 4 years ago

@siggins any luck? I remember I spent a good week building the last version 😓