kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.62k stars 461 forks source link

Page numbers are not shown in TEI files #278

Open dksanyal opened 6 years ago

dksanyal commented 6 years ago

Hi, We are extracting table of contents from a paper by reading the text between and . But it does not give any page number information. I would be thankful if you could suggest how we can extract page number (absolute if possible, relative if page numbers are absent in PDF). Any pointers regarding which source files to look at would be great! Thanks in advance!

yaojl2006 commented 6 years ago

Hi, I think you can get it from LayoutToken or Page according to your needs.

kermitt2 commented 6 years ago

This is correct @yaojl2006 thanks.

Pages are not present in the TEI on purpose, because the TEI aims at capturing the logical structure of a document. The pagination is only one possible presentation of a document. It is actually impossible to represent in a single XML document (under a single hierarchy) at the same time the logical structure of a document and a presentation rendering.

As @yaojl2006 mentioned, however, each token in GROBID is synchronized with the source PDF document, and you can access its original pagination information (see the output coordinates that can be outputted in the resulting TEI as attributes for some fields - not all fields are supported yet in the TEI).

honzajde commented 4 months ago

Coordinates include page number, see https://github.com/kermitt2/grobid/issues/397

Just to make it clear:)