inspirehep / invenio-grobid

Invenio package for integration of the Grobid metadata extraction service
GNU General Public License v2.0
4 stars 3 forks source link

Implement API calls #1

Open jalavik opened 8 years ago

jalavik commented 8 years ago

The API to interact with Grobid is described here: http://grobid.readthedocs.org/en/latest/Grobid-service/

We need API for:

import requests
res = requests.post(
    "http://grobid:8080/processCitation", 
    "citations=Graff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113"
)
print(res.text)
<biblStruct>
        <analytic>
                <title/>
                <author>
                        <persName>
                                <surname>Graff</surname>
                        </persName>
                </author>
        </analytic>
        <monogr>
                <title level="j">Expert. Opin. Ther. Targets</title>
                <imprint>
                        <biblScope unit="volume">6</biblScope>
                        <biblScope unit="issue">1</biblScope>
                        <biblScope unit="page" from="103" to="113" />
                        <date type="published" when="2002" />
                </imprint>
        </monogr>
</biblStruct>
r = requests.post(
   "http://grobid/processFulltextDocument",
   files={'input': open('1407.7587v1.pdf', 'rb')}
)
kaplun commented 8 years ago

Sorry for the OT: Does it mean Grobid supports working directly on text? In that case we could happily bypass the PDF bugs by directly sending our text version?

jacquerie commented 8 years ago

Since we are currently parsing the entire text, and then discarding it, it could make sense to implement the following endpoints:

and then measure if making these two lighter requests is faster than calling the implemented /processFulltextDocument.