kermitt2 / grobid_client_python

Python client for GROBID Web services
Apache License 2.0
279 stars 75 forks source link

grobid-client.py not generating tei.xml files #2

Closed davejavu1969 closed 5 years ago

davejavu1969 commented 5 years ago

Saw previous issue. I do have Grobid Server 0.5.5 up and running, I can call it using the web interface and all runs sweetly. Switching to command line....

Running on Virtual env produces the following output.

(env) [dmp423@servername grobid-client-python]$ python grobid-client.py --input /data-ext/user-data/dmp423/997 --output /data-ext/user-data/dmp423/997_tei processFulltextDocument 7 PDF files to process /data-ext/user-data/dmp423/997/41157381_2128235180.pdf /data-ext/user-data/dmp423/997/41160557_1972837948.pdf /data-ext/user-data/dmp423/997/41160577_1983144794.pdf /data-ext/user-data/dmp423/997/41160580_2150480952.pdf /data-ext/user-data/dmp423/997/41160602_2025317023.pdf /data-ext/user-data/dmp423/997/41160624_2107099243.pdf /data-ext/user-data/dmp423/997/41160640_2001941328.pdf runtime: 2.8 seconds (env) [dmp423@core-dev-ber01 grobid-client-python]$

The input and output folders exist, but the TEI xml files are not created.

All advice greatly appreciated! - Thanks

kermitt2 commented 5 years ago

Hi @davejavu1969 ! Thanks a lot for reporting the problem. I have just pushed a fix. Would it be possible for you to update the master branch and test again? I added a --force parameter to explicitly specify to force processing and rewrite existing TEI result files (which was the issue I think).

davejavu1969 commented 5 years ago

Hi Patrice, Thanks loads for looking art this, Grobid is really critical to what I am doing - and the new version looks like another great improvement.  It's still not creating the files, but the status messages has changed: 

/data-ext/user-data/dmp423/997/41160577_1983144794.pdf http://localhost:8070/api/processFulltextDocument 

I previously only got the first line for each of the files in the folder, I now get the second line for each one which shows the api call.  But... still no tei being written - I'll keep looking here in case it's me being daft! Many thanks, David.  On Monday, 24 June 2019, 18:56:19 BST, Patrice Lopez notifications@github.com wrote:

Hi @davejavu1969 ! Thanks a lot for reporting the problem. I have just pushed a fix. Would it be possible for you to update the master branch and test again? I added a --force parameter to explicitly specify to force processing and rewrite existing TEI result files (which was the issue I think).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

kermitt2 commented 5 years ago

Ah yes I printed the url after the pdf path, just to be sure it was calling the right service.

If the TEI files are still not written, maybe the issue is that the output directory path does not exist?

The TEI files are created on my local machine if the output path exists. I will add some more error messages and create the output directory if it does not exist.

Otherwise, you could also try to use GROBID public demo server to see if the problem would come from the GROBID server: in the config file just replace:

{
    "grobid_server": "cloud.science-miner.com/grobid",
    "grobid_port": "",
    "batch_size": 1000,
    "sleep_time": 5
}
kermitt2 commented 5 years ago

With commit 7dceec06872aaeb6a9e4d6588c56b9e12405e6d2, the output directory is now created if it does not exists.

kermitt2 commented 5 years ago

b74a7b5692b87569cf0319857b036574e36ce872, I have added more checks (GROBID server alive, TEI writing error) and more error messages.

The client was initially very basic :) but too basic apparently because it was producing these silent fails. Now it has a bit more robustness and should explain better the possible failures.

davejavu1969 commented 5 years ago

I am out at the moment but will have a look at this very shorty.  I’ll try with the online service as suggested in your previous email. 

On Monday, June 24, 2019, 10:00 pm, Patrice Lopez notifications@github.com wrote:

b74a7b5, I have added more checks (GROBID server alive, TEI writing error) and more error messages.

The client was initially very basic :) but too basic apparently because it was producing these silent fails. Now it has a bit more robustness and should explain better the possible failures.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

davejavu1969 commented 5 years ago

Have tried on both local and online server, same error. If it's writing successfully on your local machine, that points to write issues at my end. Will keep looking - thanks. :)

7 PDF files to process /data-ext/user-data/dmp423/997/41157381_2128235180.pdf /data-ext/user-data/dmp423/997/41160557_1972837948.pdf /data-ext/user-data/dmp423/997/41160577_1983144794.pdf /data-ext/user-data/dmp423/997/41160580_2150480952.pdf /data-ext/user-data/dmp423/997/41160602_2025317023.pdf /data-ext/user-data/dmp423/997/41160624_2107099243.pdf /data-ext/user-data/dmp423/997/41160640_2001941328.pdf Writing resulting TEI XML file /data-ext/user-data/dmp423/997_tei/41160602_2025317023.tei.xml failed Writing resulting TEI XML file /data-ext/user-data/dmp423/997_tei/41160580_2150480952.tei.xml failed Writing resulting TEI XML file /data-ext/user-data/dmp423/997_tei/41160624_2107099243.tei.xml failed Writing resulting TEI XML file /data-ext/user-data/dmp423/997_tei/41160577_1983144794.tei.xml failed Writing resulting TEI XML file /data-ext/user-data/dmp423/997_tei/41157381_2128235180.tei.xml failed Writing resulting TEI XML file /data-ext/user-data/dmp423/997_tei/41160640_2001941328.tei.xml failed Writing resulting TEI XML file /data-ext/user-data/dmp423/997_tei/41160557_1972837948.tei.xml failed

kermitt2 commented 5 years ago

OK that's weird!

Could you give the command you are using? Are you sure to have the rights at the output path /data-ext/user-data/dmp423/997_tei/?

davejavu1969 commented 5 years ago

Sorted. Definitely server permissions (on Uni server) - works perfectly writing to a different folder (either existing or letting Grobid create the folder, both work)

Final Q - are teiCoordinates available via the client? (I suspect not as the -teiCoordinates flag isn't valid

kermitt2 commented 5 years ago

7be8dd4861760eef987bc0f0b75dd19d8b39575e added -teiCoordinates flag, you can restrict the structures (out of a choice of 5 supported currently) you wish the coordinates in the config.json file:

"coordinates": [ "persName", "figure", "ref", "biblStruct", "formula" ]
davejavu1969 commented 5 years ago

Awesome work - many thanks!

Mayar2009 commented 4 years ago

Excuse me please for opening this issue again! after I did these steps 1) installing the latest stable release of GROBID is version 0.5.6 as follows from the powershell win10 1.1) C:> wget https://github.com/kermitt2/grobid/archive/0.5.6.zip 1.2) unzip 0.5.6.zip gave me the error (unzip: cannot find either 0.5.6.zip or 0.5.6.zip.zip.) then I do not how but going to C:\ I found Grobid folder again going to powershell and writing 2) C:> cd C:\grbid 3) C:\grobid> ./gradlew run (hang on 88% three days ago until now is that normal?) the service on http://localhost:8070/ is working as mentioned here https://komax.github.io/blog/text/mining/grobid/ and I got the tie.xml file 4) I tried to work with python client so I wrote PS C:\grobid> git clone https://github.com/kermitt2/grobid-client-python then PS C:\grobid\grobid-client-python> python3 grobid-client.py --input C:\grobidFiles\input --output C:\grobidFiles\output processFulltextDocument

in the input folder, I have just 2 pdf I could not find anything in output folder ihave also tried PS C:\grobid\grobid-client-python> python3 grobid-client.py --input C:\grobidFiles\input\ --output C:\grobidFiles\output\ processFulltextDocument What is the matter, please!