kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.45k stars 443 forks source link

How does GROBID currently calculate the coordinates for objects? #397

Open iantbutler01 opened 5 years ago

iantbutler01 commented 5 years ago

My basic problem is that I want to use another library to extract the images (figures, tables, etc) from my pdf using the coordinates produced by GROBID in the xml TEI encoding.

The issue I am running into is that it seems between GROBID and the library (https://github.com/jsvine/pdfplumber/) the coordinate systems are scaled differently(?) and I want to figure out how to properly convert between the systems so that I can properly extract the images, so I was wondering how GROBID is currently calculating its coordinates.

(I would also welcome suggestions for an alternative solution for what I am trying to accomplish.)

Thank you in advance and using GROBID has been great so far.

kermitt2 commented 5 years ago

Hello @GrandathePanda !

The explanations about the coordinate format are given there: https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/

The coordinates are given based on the original PDF scale and unit. One need to pay attention to the direction of the X and Y axis (PDF is not conventional, with origin at top left for instance).

As explained in the doc, there is indeed a scaling factor that need to be taken into account with PDF applications displaying content (and dimensions of the pages may vary in a PDF). The javascript demo contains an example of rescaling with the pixel projection of the pdf, as displayed by PDF.js.

However, pdfplumber should not rescale, so the x,y,h,w coordinated should be compatible. I think it's important to double check the used origin (top left of the page with PDF), the direction (y is top down, x if left to right) and page number (page number in PDF starts at 1, not with the conventional 0).

If I am not wrong, ResearchGate is cropping figures and tables using grobid's coordinates with PDFBox. You can see PDFBox annotations in grobid/grobid-core/src/main/java/org/grobid/core/visualization/FigureTableVisualizer.java directly with coordinates in TEI without any particular rescaling.

One of the improvement planned in GROBID is to integrate the crop of identified figure and table areas based on the bounding box coordinates. The problem is that something like PDFBox is too slow and too memory hungry for scaling efficiently (and of course pdfplumber is considerably far worse than PDFBox), so I am looking to an efficient way of doing it based on some native library.

iantbutler01 commented 5 years ago

Okay, so I looked into everything and looked at the scaling example from the JS example in GROBID and the conclusion I'm reaching right now is that maybe my PDF is just not being parsed correctly with respect to the box coordinates in GROBID.

These are the coordinates that pdfplumber comes up with: {'name': 'Im3', 'x0': Decimal('104.809'), 'y0': Decimal('554.797'), 'x1': Decimal('694.117'), 'y1': Decimal('707.501'), 'width': Decimal('589.308'), 'height': Decimal('152.704'), 'object_type': 'figure', 'page_number': 6, 'top': Decimal('84.499'), 'bottom': Decimal('237.203'), 'doctop': Decimal('4044.499')} While these are the coordinates that GROBID produces:

xml:id="fig_3" coords="6,210.94,237.07,194.22,7.96"

(It looks liked PDFplumber's bottom coordinate is your top coordinate)

With that being the case the top of the image seems to be caught reasonably well, but the total height and width in GROBID is off.

This is the visualization of the bounding box from pdfplumber: plumberrect

And this is the visualization of the bounding box Grobid Produced: grobidbox

These come from this PDF: 1901.00085.pdf

kermitt2 commented 5 years ago

GROBID coordinates look valid for this PDF. For this figure the coordinates match the figure caption (the svg graphics are not in the annoated areas, not sure why, it might be due to the fact that we just move to pdfalto last week). With PDFBox we have for coords="6,210.94,237.17,194.22,7.86":

screenshot from 2019-02-20 06-30-21

Tables are better recognized:

screenshot from 2019-02-20 06-32-31

These coordinates can be directly used with PDFBox and PDF.js. I've never used pdfplumber, so unfortunately I cannot really suggest anything beyond my first message.

iantbutler01 commented 5 years ago

Ok, thank you for the help I really appreciate it. It looks like for right now, since I definitely need the SVGs and I'm just going to do something simpler then and extract all figures with their page numbering and just use the extraction order + page number to match them back to the section they should belong to. It might be a little less precise that way but it will generalize well to other papers that might have this issue I think.

Feel free to close the issue and I can also open a separate issue for the SVG not being captured by the coordinates if you want.

de-code commented 5 years ago

@GrandathePanda is any of your code available somewhere?

iantbutler01 commented 5 years ago

@de-code I can dig up what I did with pdfplumber to compare the bounding boxes and make a gist for you if that works. I can likely get around to that by the weekend.

dennyluan commented 4 years ago

hey @GrandathePanda I'm curious about your workflow for this too!

manuelrech commented 7 months ago

condsidering that for me article_dict is a dictionary where I have saved figures with coordinates

def parse_figures(
    pdf_path: str, 
    article_dict: Dict,
    output_folder: str = "figures",
):
    if not op.isdir(output_folder):
        os.makedirs(output_folder)

    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    figures = article_dict['figures']
    for figure in figures:
        page, x, y, w, h = map(float, figure['figure_coordinates'].split(','))
        fig_id = figure['figure_id']

        # Select the page
        page = pdf_document.load_page(int(page)-1)

        # Render the page as an image
        #TODO: fix the coordinates of the figure
        pix = page.get_pixmap(clip=fitz.Rect(x, y, x+w, y+h))

        # Save the rendered image to a bytes object
        img_bytes = pix.tobytes("png")

        # Open the image with PIL for further manipulation or direct saving
        image = Image.open(io.BytesIO(img_bytes))
        image.save(f'{output_folder}/{fig_id}.png')  # Save the image