kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.42k stars 444 forks source link

Invalid XML on the output #73

Closed dtkaczyk closed 9 years ago

dtkaczyk commented 9 years ago

Hello, I am trying to use Grobid batch mode to extract structured full text from PDFs (-exe processFullText). Unfortunately, the latest code version gives me invalid XML files on the output. I am not sure whether this is caused by problems with my installation or the source code. Here is an fragment of the output XML I obtained processing the example file "Metab_Brain_Dis_2011_Mar_9_26(1)_1-8/11011_2011_Article_9233.pdf" from PMC_sample_1943 dataset:

<text xml:lang="en">
                                                <p>the nature of the deficient enzyme, eleven types and subtypes of MPS are already recognized. Partially degraded GAGs accumulate in cells of MPS patients, causing dysfunction of tissues and organs, including the heart, respiratory system, bones, joints and central nervous system (CNS). In MPS types characterized by the storage of heparan sulfate (HS), one of the GAGs, neurological symptoms are among the most severe ones (Neufeld and Muenzer 2001). Although enzyme replacement therapy (ERT) has been developed for various types of MPS, and registered to date for three of them (types I, II and VI), such a treatment is ineffective for neurological symptoms as the enzyme cannot cross the blood-brainbarrier (for reviews, see Beck 2007a, b). The second therapy used in a relatively large fraction of MPS patients, A. Kloska : J. Jakóbkiewicz-Banecka : G. Węgrzyn (*)</p>

                                <p>Department                           <p>e-mail: wegrzyn@biotech.ug.gda.pl                            <p>bone marrow transplantation, has been shown to be efficient to some extent in MPS type I, but not in MPS III (Sanfilippo disease) (Beck 2007a, b), an MPS type characterized by the most severe CNS dysfunctions among mucopolysaccharidoses (for recent discussion, see Węgrzyn et al. 2010).</p>

It seems to lack the opening <body> tag (there is a closing one at the end of the document), as well as a few closing </p> tags. I have similar problems with other PDF files as well. Only the full text part is corrupted, the headers and references seem to be correct.

kermitt2 commented 9 years ago

Hello Dominika,

Thanks you for reporting the problem! It appears that I cannot reproduce it. The TEI file that I obtain for this PDF file of the PMC_sample_1943 is well-formed and different:

https://grobid.s3.amazonaws.com/11011_2011_Article_9233.tei.xml

Which version of GROBID are you using?

Normally all the TEI are well-formed since some changes in March which made the full text TEI much more robust. Since 2 weeks there is a small validity error when the coordinates are outputted in the TEI file, because it does not follow a TEI scheme for the moment (I am a bit slow for recreating the RNG schema). Otherwise, the TEI are normally also valid when not outputing the coordinates in the original PDF.

dtkaczyk commented 9 years ago

Hi Patrice, Thank you for the answer.

I checked out the latest version today: 0.3.9-SNAPSHOT. I am using CRF++ library compiled for Linux 32-bit, but this does not seem relevant to the problem.

Here is the exact file I obtained: http://pastebin.com/ReH7snh7

If you think of anything that might cause this, please let me know. I will try to debug it deeper anyway.

kermitt2 commented 9 years ago

I think the problem is the CRF++ library for linux 32 bits. I removed it from the GROBID distribution since one or two years because I can't compile and update it anymore (I don't have access to a 32 bit linux set-up and CRF++ is really hard to build with JNI).

In addition, the models for CRF++ might not have been updated for the full text model (I always forget!). I plan to remove the support of CRF++ from GROBID as soon we can have Wapiti working on Windows. At this point of time, I can only guarantee Mac OS and Linux 64 bit versions!

dtkaczyk commented 9 years ago

Thanks for the information, until now I thought the choice of CRF library affects only the time performance.

Would it be ok to use Wapiti on Linux 32-bit? Or I should also move everything to 64-bit?

kermitt2 commented 9 years ago

I think Wapiti on Linux 32-bit should work well! But you have to build with this fork -> Wapiti which includes JNI and important bug fixes for GROBID. If it works well, I could add Wapiti 32 bits to GROBID!

dtkaczyk commented 9 years ago

Everything worked fine, and miraculously the output XMLs seem valid now. Thanks a lot for your help! Would you like me to provide the resulting .so files?

kermitt2 commented 9 years ago

Very good !! Yes many thanks in advance for the .so, I will add it so that people on Linux-32 can have a ""out--of-the-box" solution. :+1: