kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.55k stars 454 forks source link

Not able to parse content from tables in PDF. It skips pages #1044

Open sandeepsamant1702 opened 1 year ago

sandeepsamant1702 commented 1 year ago
kermitt2 commented 1 year ago

Hello @sandeepsamant1702 !

Which version of Grobid are you using?

In 0.7.3, it is encoded like this in the result XML:

            <figure
                xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0">
                <head>Table 1</head>
                <label>1</label>
                <figDesc>Summary statistics of welfare aggregates monthly).</figDesc>
                <table>
                    <row>
                        <cell>Variable</cell>
                        <cell>Mean (USD)</cell>
                        <cell>Mean (JD)</cell>
                        <cell>Std. Dev. (JD)</cell>
                        <cell>Min (JD)</cell>
                        <cell>Max (JD)</cell>
                    </row>
                    <row>
                        <cell>Income per capita</cell>
                        <cell>49.63</cell>
                        <cell>34.95</cell>
                        <cell>64.41</cell>
                        <cell>0</cell>
                        <cell>3000</cell>
                    </row>
                    ...
                </table>
            </figure> 
sandeepsamant1702 commented 1 year ago

The latest version only . I clone using git clone 'git clone https://github.com/kermitt2/grobid.git'. I still didn't understood you. The problem with me is that the pdf I am parsing contains tables. so whenever a table comes up grobid skips the entire page reading only the first line of the table

kermitt2 commented 1 year ago

The latest version only

Sorry, the master version is currently work-in-progress with respect to table and figures, you should use the latest stable version 0.7.3, for example the docker image.

Could you share maybe this PDF so that I could try to reproduce the error ?

sandeepsamant1702 commented 1 year ago

https://www.who.int/publications/i/item/9789241549684

sandeepsamant1702 commented 1 year ago

I am getting issue with version 0.7.3 when doing ./gradlew run. It gives me error on Java "undefined symbol: __libc_pthread_init, version GLIBC_PRIVATE" . I am using open jdk 11? Does it require some other java version? I tried version 17 also for jdk..gives the same issue

Precisely the error is:

/usr/lib/jvm/java-11-openjdk-amd64/bin/java: symbol lookup error: grobid-0.7.3/grobid-home/lib/lin-64/libpthread.so.0: undefined symbol: __libc_pthread_init, version GLIBC_PRIVATE

kermitt2 commented 1 year ago

I am getting issue with version 0.7.3 when doing ./gradlew run. It gives me error on Java "undefined symbol: __libc_pthread_init, version GLIBC_PRIVATE" . I am using open jdk 11? Does it require some other java version?

Ahh this error comes from your glib version, see https://github.com/kermitt2/grobid/issues/1019, the fix is to use the master version where I rebuilt the native lib to avoid this error :D

Anyway, I tried the PDF withh 0.7.3 and master, I have the same result:

Does it help ?