Closed dtkaczyk closed 9 years ago
Hello Dominika,
Thanks you for reporting the problem! It appears that I cannot reproduce it. The TEI file that I obtain for this PDF file of the PMC_sample_1943 is well-formed and different:
https://grobid.s3.amazonaws.com/11011_2011_Article_9233.tei.xml
Which version of GROBID are you using?
Normally all the TEI are well-formed since some changes in March which made the full text TEI much more robust. Since 2 weeks there is a small validity error when the coordinates are outputted in the TEI file, because it does not follow a TEI scheme for the moment (I am a bit slow for recreating the RNG schema). Otherwise, the TEI are normally also valid when not outputing the coordinates in the original PDF.
Hi Patrice, Thank you for the answer.
I checked out the latest version today: 0.3.9-SNAPSHOT. I am using CRF++ library compiled for Linux 32-bit, but this does not seem relevant to the problem.
Here is the exact file I obtained: http://pastebin.com/ReH7snh7
If you think of anything that might cause this, please let me know. I will try to debug it deeper anyway.
I think the problem is the CRF++ library for linux 32 bits. I removed it from the GROBID distribution since one or two years because I can't compile and update it anymore (I don't have access to a 32 bit linux set-up and CRF++ is really hard to build with JNI).
In addition, the models for CRF++ might not have been updated for the full text model (I always forget!). I plan to remove the support of CRF++ from GROBID as soon we can have Wapiti working on Windows. At this point of time, I can only guarantee Mac OS and Linux 64 bit versions!
Thanks for the information, until now I thought the choice of CRF library affects only the time performance.
Would it be ok to use Wapiti on Linux 32-bit? Or I should also move everything to 64-bit?
I think Wapiti on Linux 32-bit should work well! But you have to build with this fork -> Wapiti which includes JNI and important bug fixes for GROBID. If it works well, I could add Wapiti 32 bits to GROBID!
Everything worked fine, and miraculously the output XMLs seem valid now. Thanks a lot for your help! Would you like me to provide the resulting .so files?
Very good !! Yes many thanks in advance for the .so, I will add it so that people on Linux-32 can have a ""out--of-the-box" solution. :+1:
Hello, I am trying to use Grobid batch mode to extract structured full text from PDFs (-exe processFullText). Unfortunately, the latest code version gives me invalid XML files on the output. I am not sure whether this is caused by problems with my installation or the source code. Here is an fragment of the output XML I obtained processing the example file "Metab_Brain_Dis_2011_Mar_9_26(1)_1-8/11011_2011_Article_9233.pdf" from PMC_sample_1943 dataset:
It seems to lack the opening
<body>
tag (there is a closing one at the end of the document), as well as a few closing</p>
tags. I have similar problems with other PDF files as well. Only the full text part is corrupted, the headers and references seem to be correct.