Grobid, get the page number for the references

kermitt2 / grobid

A machine learning software for extracting information from scholarly documents

Apache License 2.0

3.44k stars 443 forks source link

Hi @rabia0001 !

1) First have a look at the returned XML document (r.content) to check the @coord is present as expected, if yes it means that the problem is your usage of BeautifulSoup

2) Replace references.find_all('biblstruct') by references.find_all('biblStruct'), XML is case-sensitive

3) For more robust BeautifulSoup code, you can use has_attr() to avoid the error (although there should not be an error!)

I also point the documentation on coordinate format to get the page number -> https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/

kermitt2 / grobid

Grobid, get the page number for the references #982