Coordinates of caption elements

keto33 commented 1 year ago

This may seem unnecessary, but it should be a feasible feature suggestion.

GROBID outputs all coordinates of structures except for text blocks. I am mostly interested in the coordinates of figure captions. When figures are embedded as EPS in vector format rather than raster/bitmap, GROBID does not correctly detect the bounding box of the figure, as drawings and texts are somehow blended into the PDF structure rather than being a distinguishable stream. In such cases, the bounding box of the figure caption can be helpful in estimating the actual bounding box of the EPS figure.

kermitt2 commented 1 year ago

Hi @keto33 !

Thanks for the issue.

GROBID outputs all coordinates of structures except for text blocks.

Yes text blocks are not part of the TEI XML output because they are presentation/layout elements, not something related to the logicial structure of the document (like paragraphs, titles, etc.).

I am mostly interested in the coordinates of figure captions. When figures are embedded as EPS in vector format rather than raster/bitmap, GROBID does not correctly detect the bounding box of the figure, as drawings and texts are somehow blended into the PDF structure rather than being a distinguishable stream.

Yes the coordinates of the caption elements are indeed not outputted currently and there is no reason not to do it.

Regarding the "graphic part" of a figure, this is more or less implemented in PR #963 (the whole PR is not usable at this stage, really work in progress), the vector graphics are further analyzed to detect their boundaries, deal with overlapped text, etc. so that we have reliable "figure graphic" aggregated elements similar to the embedded bitmaps. There are many other things in this PR and it will take a lot time to be completed !

ClementFrvl commented 3 months ago

Hello!

Is there an ongoing effort or a specific branch where coordinates of text blocks can be extracted as part of the TEI/XML output?

I checked the documentation and I saw p elements are under teiCoordinates, and I am running this command:

curl --form input=@./Papers/test.pdf --form teiCoordinates='head' --form teiCoordinates='p' host:8070/api/processFulltextDocument

However there are no coordinates for the p elements, which I'm interested in.

Please let me know if there is a solution or anything I can do to assist!

lfoppiano commented 3 months ago

Hi @ClementFrvl, which version are you using? This seems a problem of grobid version 0.8.0 which disappears on the grobid master's version. 🤔

ClementFrvl commented 3 months ago

Hey, I am using 0.8.0, that may be the reason why.

My server is ARM-based though, I just tried with version 0.7.3, but I'm having the same issue.

Is there a newer arm version available ?

lfoppiano commented 3 months ago

We're working on a new version since a few weeks, hopefully we will be able to release soon.

lfoppiano commented 3 weeks ago

Should be solved in version 0.8.1

kermitt2 / grobid

Coordinates of caption elements #1008