kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Fulltexts and many bibliographical information not extracted from a specific pdf #237

Closed lfoppiano closed 6 years ago

lfoppiano commented 7 years ago

In the following PDF:

https://link.springer.com/chapter/10.1007%2F978-3-642-21560-5_33

no fulltext and no bibliographical information are extracted, here the output xml obtained:

<?xml version="1.0" encoding="UTF-8"?>
<TEI
    xmlns="http://www.tei-c.org/ns/1.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 /home/lopez/grobid/grobid-home/schemas/xsd/Grobid.xsd"
    xmlns:xlink="http://www.w3.org/1999/xlink">
    <teiHeader xml:lang="en">
        <encodingDesc>
            <appInfo>
                <application version="0.4.2-SNAPSHOT" ident="GROBID" when="2017-09-06T15:57+0000">
                    <ref target="https://github.com/kermitt2/grobid">GROBID - A machine learning software for extracting information from scholarly documents</ref>
                </application>
            </appInfo>
        </encodingDesc>
        <fileDesc>
            <titleStmt>
                <title level="a" type="main">PN interference CMN backbone CMN−MC link PN PN CMN CMN CMN CMN CMN CMN MC MC</title>
            </titleStmt>
            <publicationStmt>
                <publisher/>
                <availability status="unknown">
                    <licence/>
                </availability>
            </publicationStmt>
            <sourceDesc>
                <biblStruct>
                    <analytic>
                        <title level="a" type="main">PN interference CMN backbone CMN−MC link PN PN CMN CMN CMN CMN CMN CMN MC MC</title>
                    </analytic>
                    <monogr>
                        <imprint>
                            <date/>
                        </imprint>
                    </monogr>
                    <note>Internet</note>
                </biblStruct>
            </sourceDesc>
        </fileDesc>
        <profileDesc>
            <abstract/>
        </profileDesc>
    </teiHeader>
    <text xml:lang="en">
        <body></body>
        <back>
            <div type="references">
                <listBibl/>
            </div>
        </back>
    </text>
</TEI>
kermitt2 commented 7 years ago

It's actually a pdf2xml issue, the tokens in the produced XML file are all empty when they arrive to GROBID.

kermitt2 commented 6 years ago

Moving this issue to pdf2xml... Note: file is copyrighted, I replace it with a link to the pdf on publisher site.