kermitt2 / grobid_client_python

Python client for GROBID Web services
Apache License 2.0
275 stars 74 forks source link

Self-promotion: new `grobid_tei_xml` python library #41

Open bnewbold opened 2 years ago

bnewbold commented 2 years ago

Wanted to share this new python library for parsing metadata out of GROBID "flavor" TEI-XML:

As mentioned in the README, there are a couple other libraries that do similar or the same thing, including generic TEI parsing libraries which are not specific to GROBID. At scholar.archive.org we had a need to extract header and citation metadata in a structured but non-XML format (eg, JSON or python objects), so we wrote this. It uses only the Python 3 standard library, includes type annotations, and has decent test coverage. It supports both older ~v0.5 era GROBID documents as well as more recent output. We have run the output of tens of millions of PDFs through GROBID and this code.