Open iantbutler01 opened 5 years ago
Hello @GrandathePanda !
The explanations about the coordinate format are given there: https://grobid.readthedocs.io/en/latest/Coordinates-in-PDF/
The coordinates are given based on the original PDF scale and unit. One need to pay attention to the direction of the X and Y axis (PDF is not conventional, with origin at top left for instance).
As explained in the doc, there is indeed a scaling factor that need to be taken into account with PDF applications displaying content (and dimensions of the pages may vary in a PDF). The javascript demo contains an example of rescaling with the pixel projection of the pdf, as displayed by PDF.js.
However, pdfplumber should not rescale, so the x,y,h,w coordinated should be compatible. I think it's important to double check the used origin (top left of the page with PDF), the direction (y is top down, x if left to right) and page number (page number in PDF starts at 1, not with the conventional 0).
If I am not wrong, ResearchGate is cropping figures and tables using grobid's coordinates with PDFBox. You can see PDFBox annotations in grobid/grobid-core/src/main/java/org/grobid/core/visualization/FigureTableVisualizer.java
directly with coordinates in TEI without any particular rescaling.
One of the improvement planned in GROBID is to integrate the crop of identified figure and table areas based on the bounding box coordinates. The problem is that something like PDFBox is too slow and too memory hungry for scaling efficiently (and of course pdfplumber is considerably far worse than PDFBox), so I am looking to an efficient way of doing it based on some native library.
Okay, so I looked into everything and looked at the scaling example from the JS example in GROBID and the conclusion I'm reaching right now is that maybe my PDF is just not being parsed correctly with respect to the box coordinates in GROBID.
These are the coordinates that pdfplumber comes up with:
{'name': 'Im3', 'x0': Decimal('104.809'), 'y0': Decimal('554.797'), 'x1': Decimal('694.117'), 'y1': Decimal('707.501'), 'width': Decimal('589.308'), 'height': Decimal('152.704'), 'object_type': 'figure', 'page_number': 6, 'top': Decimal('84.499'), 'bottom': Decimal('237.203'), 'doctop': Decimal('4044.499')}
While these are the coordinates that GROBID produces:
xml:id="fig_3" coords="6,210.94,237.07,194.22,7.96"
(It looks liked PDFplumber's bottom coordinate is your top coordinate)
With that being the case the top of the image seems to be caught reasonably well, but the total height and width in GROBID is off.
This is the visualization of the bounding box from pdfplumber:
And this is the visualization of the bounding box Grobid Produced:
These come from this PDF: 1901.00085.pdf
GROBID coordinates look valid for this PDF. For this figure the coordinates match the figure caption (the svg graphics are not in the annoated areas, not sure why, it might be due to the fact that we just move to pdfalto last week). With PDFBox we have for coords="6,210.94,237.17,194.22,7.86"
:
Tables are better recognized:
These coordinates can be directly used with PDFBox and PDF.js. I've never used pdfplumber, so unfortunately I cannot really suggest anything beyond my first message.
Ok, thank you for the help I really appreciate it. It looks like for right now, since I definitely need the SVGs and I'm just going to do something simpler then and extract all figures with their page numbering and just use the extraction order + page number to match them back to the section they should belong to. It might be a little less precise that way but it will generalize well to other papers that might have this issue I think.
Feel free to close the issue and I can also open a separate issue for the SVG not being captured by the coordinates if you want.
@GrandathePanda is any of your code available somewhere?
@de-code I can dig up what I did with pdfplumber to compare the bounding boxes and make a gist for you if that works. I can likely get around to that by the weekend.
hey @GrandathePanda I'm curious about your workflow for this too!
condsidering that for me article_dict is a dictionary where I have saved figures with coordinates
def parse_figures(
pdf_path: str,
article_dict: Dict,
output_folder: str = "figures",
):
if not op.isdir(output_folder):
os.makedirs(output_folder)
# Open the PDF file
pdf_document = fitz.open(pdf_path)
figures = article_dict['figures']
for figure in figures:
page, x, y, w, h = map(float, figure['figure_coordinates'].split(','))
fig_id = figure['figure_id']
# Select the page
page = pdf_document.load_page(int(page)-1)
# Render the page as an image
#TODO: fix the coordinates of the figure
pix = page.get_pixmap(clip=fitz.Rect(x, y, x+w, y+h))
# Save the rendered image to a bytes object
img_bytes = pix.tobytes("png")
# Open the image with PIL for further manipulation or direct saving
image = Image.open(io.BytesIO(img_bytes))
image.save(f'{output_folder}/{fig_id}.png') # Save the image
My basic problem is that I want to use another library to extract the images (figures, tables, etc) from my pdf using the coordinates produced by GROBID in the xml TEI encoding.
The issue I am running into is that it seems between GROBID and the library (https://github.com/jsvine/pdfplumber/) the coordinate systems are scaled differently(?) and I want to figure out how to properly convert between the systems so that I can properly extract the images, so I was wondering how GROBID is currently calculating its coordinates.
(I would also welcome suggestions for an alternative solution for what I am trying to accomplish.)
Thank you in advance and using GROBID has been great so far.