pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.35k stars 509 forks source link

Discrepancy with mediabox and cropbox from PDFs #1615

Closed Jdogzz closed 2 years ago

Jdogzz commented 2 years ago

Please provide all mandatory information!

Describe the bug (mandatory)

For certain PDFs I seem to have found a discrepancy between the mediabox and cropbox in pymupdf, and this discrepancy is not shown in other tools. Here is an example of a PDF producing this discrepancy:

mwb_E_202201.pdf

Using the tool pdfinfo on linux I get the following result showing identical mediabox and cropbox output:

$ pdfinfo -box mwb_E_202201.pdf
Title:          mwb22.01-E
Author:         Christian Congregation of Jehovah's Witnesses
Producer:       Acrobat Distiller 15.0 (Windows); modified using iTextSharp™ 5.5.3 ©2000-2014 iText Group NV (AGPL-version)
CreationDate:   Mon Dec 13 12:52:31 2021 PST
ModDate:        Mon Dec 13 12:53:01 2021 PST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          16
Encrypted:      no
Page size:      504 x 648 pts
Page rot:       0
MediaBox:           0.00    17.05   504.00   665.05
CropBox:            0.00    17.05   504.00   665.05
BleedBox:           0.00    17.05   504.00   665.05
TrimBox:            0.00    17.05   504.00   665.05
ArtBox:             0.00    17.05   504.00   665.05
File size:      1951513 bytes
Optimized:      yes
PDF version:    1.4

Using Adobe Acrobat DC I get the following which perfectly matches the pdfinfo output: 2022-02-25_07-53

To Reproduce (mandatory)

Minimal set of steps printing out the mediabox and cropbox for the above PDF:

Python 3.7.3 (default, Jan 22 2021, 20:04:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fitz
>>> fitz.__doc__
'\nPyMuPDF 1.19.6: Python bindings for the MuPDF 1.19.0 library.\nVersion date: 2022-03-01 00:00:01.\nBuilt for Python 3.7 on linux (64-bit).\n'
>>> doc=fitz.open("mwb_E_202201.pdf")
>>> page=doc.load_page(1)
>>> page.CropBox
Rect(17.0, 0.0, 521.0, 648.0)
>>> page.mediabox
Rect(17.0, 17.049999237060547, 521.0, 665.0499877929688)

Expected behavior (optional)

The mediabox and cropbox should be reported as identical, and match what other tools like pdinfo and Adobe Acrobat DC have.

Your configuration (mandatory)

JorjMcKie commented 2 years ago

Please make yourself acquainted with how MuPDF geometry works. The major difference is that PDF (and hence most tools on the market) count page coordinates from the page's bottom-left point upwards. MuPDF, and hence PyMuPDF, do not do this: point (0,0) is the top-left point. The only exception is the MediaBox. Based on this, all the values reported by PyMuPDF are correct.

Jdogzz commented 2 years ago

@JorjMcKie Hi, thanks for the very quick response and the pointers. I have now read the documentation related to this, but I'm still having an issue if you would be so kind as to help me sort this out (and please let me know if this belongs in a separate issue). In short, I am running into errors with the page.set_cropbox command as one of the checks inside it is failing, possibly due to lossy decimal values (rather than incorrect coordinates as I had previously thought).

Looking at the source code for page.set_cropbox it takes the provided rectangle (in my case I am passing it page.CropBox as the argument rect) and performs this transformation on it, which lines up with my understanding of your description of the different coordinate systems at play:

rect = Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1])

For my example PDF in the initial post, it has the page.cropbox as Rect(17.0, 0.0, 521.0, 648.0) (PyMuPDF coordinates) and the page.mediabox as Rect(17.0, 17.049999237060547, 521.0, 665.0499877929688) (PDF coordinates). After the transformation it gives this as the new value of rect:

Rect(17.0, 17.04998779296875, 521.0, 665.0499877929688)

After this it checks if the new rect is in the mediabox, and this reports false as the y region is slightly larger for the new rect than the mediabox. I have examined the source code of the PDF I included in the initial post and it gives identical values (in PDF coordinates) for the cropbox and mediabox:

  /CropBox [
    0
    17.05
    504
    665.05
  ]
  /MediaBox [
    0
    17.05
    504
    665.05
  ]

so presumably somewhere along the way the exact decimal values are lost in processing and the lossy values then cause the check to fail.

Attempts to analyze the transformations and check for another example PDF which has integer values for the mediabox as I see in the PDF source code show a successful set of transformations and check: 2002.01247.pdf

Python 3.7.3 (default, Jan 22 2021, 20:04:44) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fitz
>>> doc=fitz.open("/home/mymailuser/Downloads/2002.01247.pdf")
>>> page=doc.load_page(1)
>>> page.set_cropbox(page.cropbox)
[the new rect outputs as Rect(0.0, 0.0, 612.0, 792.0) before the check which is successful]
>>> page.mediabox
Rect(0.0, 0.0, 612.0, 792.0)
>>> page.CropBox
Rect(0.0, 0.0, 612.0, 792.0)

I would appreciate any pointers for handling this situation.

JorjMcKie commented 2 years ago

You did find an issue! Please open a new one - choose a title like "incorrect handling of non-integer PDF rectangle coordinates". The error goes back to the base libary MuPDF: already the mediabox value (returned directly from MuPDF computations) is incorrect: your example Rect(17.0, 17.049999237060547, 521.0, 665.0499877929688) instead of Rect(17.0, 17.05, 521, 665.05).