Open rabia0001 opened 1 year ago
Hi @rabia0001 !
1) First have a look at the returned XML document (r.content
) to check the @coord
is present as expected, if yes it means that the problem is your usage of BeautifulSoup
2) Replace references.find_all('biblstruct')
by references.find_all('biblStruct')
, XML is case-sensitive
3) For more robust BeautifulSoup code, you can use has_attr()
to avoid the error (although there should not be an error!)
I also point the documentation on coordinate format to get the page number -> https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/
Hi I am trying to get the page number for references in sections and in citations as well. I turn on the TEI coordinates in the process_fulltext_document. Iam not sure how to get the coordinates using Beautiful soup.
parsed_article = BeautifulSoup(r.content, 'lxml') if article.find('text') is not None: references = article.find('text').find('div', attrs={'type': 'references'}) references = references.find_all('biblstruct') if references is not None else [] reference_list = [] for reference in references: print(reference['coords'])
When I try to do this I get an error that attribute is not there. do you know how can I fix it ?