kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.44k stars 443 forks source link

Grobid, get the page number for the references #982

Open rabia0001 opened 1 year ago

rabia0001 commented 1 year ago

Hi I am trying to get the page number for references in sections and in citations as well. I turn on the TEI coordinates in the process_fulltext_document. Iam not sure how to get the coordinates using Beautiful soup.

parsed_article = BeautifulSoup(r.content, 'lxml') if article.find('text') is not None: references = article.find('text').find('div', attrs={'type': 'references'}) references = references.find_all('biblstruct') if references is not None else [] reference_list = [] for reference in references: print(reference['coords'])

When I try to do this I get an error that attribute is not there. do you know how can I fix it ?

kermitt2 commented 1 year ago

Hi @rabia0001 !

1) First have a look at the returned XML document (r.content) to check the @coord is present as expected, if yes it means that the problem is your usage of BeautifulSoup

2) Replace references.find_all('biblstruct') by references.find_all('biblStruct'), XML is case-sensitive

3) For more robust BeautifulSoup code, you can use has_attr() to avoid the error (although there should not be an error!)

I also point the documentation on coordinate format to get the page number -> https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/