kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.59k stars 459 forks source link

Extract a figure or table by pymupdf in python, from coordinate. #1118

Closed thejiangcj closed 5 months ago

thejiangcj commented 6 months ago

I have look at official documents about coordinate, which describes:

A bounding box is defined by the following attributes:

- p: the number of the page (beware, in the PDF world the first page has index 1!),
- x: the x-axis coordinate of the upper-left point of the bounding box,
- y: the y-axis coordinate of the upper-left point of the bounding box (beware, in the PDF world the y-axis extends downward!),
- h: the height of the bounding box,
- w: the width of the bounding box.

However ,pymupdf support coordinate: (x1,y1,x2,y2), which is (x1,y1) is the upper-left coord and (x2,y2) is the bottom-right coord.

So in order to crop image, should compute:

cor2 = cor.split(",")
page = int(cor2[0])
x1 = float(cor2[1])
y1 = float(cor2[2])
x2 = float(cor2[1])+float(cor2[4]) # (x1+w)
y2 = float(cor2[2])+float(cor2[3]) # (y1+h)

However I get wrong result but I revise the code to

cor2 = cor.split(",")
page = int(cor2[0])
x1 = float(cor2[1])
y1 = float(cor2[2])
x2 = float(cor2[1])+float(cor2[3]) # (x1+h)
y2 = float(cor2[2])+float(cor2[4]) # (y1+w)

which get right image and crop suitable.

So i wonder why my first compute result is wrong? could someone give me a detail explaination?

lfoppiano commented 6 months ago

I think you're right, the third coordinate is the width and not the height.

Just found some code and I did this:

{"page": box[0], "x": box[1], "y": box[2], "width": box[3], "height": box[4]}

This is also confirmed by the grobid code:

public String toString() {
        return String.format("%d,%.2f,%.2f,%.2f,%.2f", page, x, y, width, height);
    }

Actually the documentation is correct, looking at the Coordinates in JSON results you might have assumed that it was the order, however it's the same order as in the JSON result, where the h is coming before the w. If you check down in the Coordinates in TEI/XML results the examples are correctly mapping the third element as the width and the fourth as the height.