kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.61k stars 461 forks source link

Introducing grobidmonkey: A Python Package for grobid output Parsing #1098

Open com3dian opened 7 months ago

com3dian commented 7 months ago

Last year, I reached out to the community seeking a Python solution for extracting and parsing content from Grobid's TEI-XML output. Under the original issue, I noticed other users expressing the same need. Faced with these challenges, I've taken the initiative to develop a Python package named grobidmonkey to address this issue.

While it's still in its early versions, I believe grobidmonkey can be a valuable tool for the community. I'm eager to hear your thoughts and feedback to make it better.

GitHub Repository: grobidmonkey

The package is currently only available through pip and can be installed with

pip install grobidmonkey

to use it you can run

from grobidmonkey import reader
monkeyReader = reader.MonkeyReader('monkey') # or 'lxml' or 'x2d'

# read paper outline
outline = monkeyReader.readOutline('/path/to/your/paper.pdf.tei.xml')

# read paper content
essay = monkeyReader.readEssay('/path/to/your/paper.pdf.tei.xml')
lfoppiano commented 6 months ago

@com3dian thanks for your contribution. I did not yet the opportunity to test it. As soon as I do I will surely write you my feedback.