Closed Jdogzz closed 2 years ago
Please make yourself acquainted with how MuPDF geometry works. The major difference is that PDF (and hence most tools on the market) count page coordinates from the page's bottom-left point upwards. MuPDF, and hence PyMuPDF, do not do this: point (0,0) is the top-left point. The only exception is the MediaBox. Based on this, all the values reported by PyMuPDF are correct.
@JorjMcKie Hi, thanks for the very quick response and the pointers. I have now read the documentation related to this, but I'm still having an issue if you would be so kind as to help me sort this out (and please let me know if this belongs in a separate issue). In short, I am running into errors with the page.set_cropbox command as one of the checks inside it is failing, possibly due to lossy decimal values (rather than incorrect coordinates as I had previously thought).
Looking at the source code for page.set_cropbox it takes the provided rectangle (in my case I am passing it page.CropBox as the argument rect) and performs this transformation on it, which lines up with my understanding of your description of the different coordinate systems at play:
rect = Rect(rect[0], mb.y1 - rect[3], rect[2], mb.y1 - rect[1])
For my example PDF in the initial post, it has the page.cropbox as Rect(17.0, 0.0, 521.0, 648.0)
(PyMuPDF coordinates) and the page.mediabox as Rect(17.0, 17.049999237060547, 521.0, 665.0499877929688)
(PDF coordinates). After the transformation it gives this as the new value of rect:
Rect(17.0, 17.04998779296875, 521.0, 665.0499877929688)
After this it checks if the new rect is in the mediabox, and this reports false as the y region is slightly larger for the new rect than the mediabox. I have examined the source code of the PDF I included in the initial post and it gives identical values (in PDF coordinates) for the cropbox and mediabox:
/CropBox [
0
17.05
504
665.05
]
/MediaBox [
0
17.05
504
665.05
]
so presumably somewhere along the way the exact decimal values are lost in processing and the lossy values then cause the check to fail.
Attempts to analyze the transformations and check for another example PDF which has integer values for the mediabox as I see in the PDF source code show a successful set of transformations and check: 2002.01247.pdf
Python 3.7.3 (default, Jan 22 2021, 20:04:44)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fitz
>>> doc=fitz.open("/home/mymailuser/Downloads/2002.01247.pdf")
>>> page=doc.load_page(1)
>>> page.set_cropbox(page.cropbox)
[the new rect outputs as Rect(0.0, 0.0, 612.0, 792.0) before the check which is successful]
>>> page.mediabox
Rect(0.0, 0.0, 612.0, 792.0)
>>> page.CropBox
Rect(0.0, 0.0, 612.0, 792.0)
I would appreciate any pointers for handling this situation.
You did find an issue!
Please open a new one - choose a title like "incorrect handling of non-integer PDF rectangle coordinates".
The error goes back to the base libary MuPDF: already the mediabox value (returned directly from MuPDF computations) is incorrect: your example Rect(17.0, 17.049999237060547, 521.0, 665.0499877929688)
instead of Rect(17.0, 17.05, 521, 665.05)
.
Please provide all mandatory information!
Describe the bug (mandatory)
For certain PDFs I seem to have found a discrepancy between the mediabox and cropbox in pymupdf, and this discrepancy is not shown in other tools. Here is an example of a PDF producing this discrepancy:
mwb_E_202201.pdf
Using the tool pdfinfo on linux I get the following result showing identical mediabox and cropbox output:
Using Adobe Acrobat DC I get the following which perfectly matches the pdfinfo output:
To Reproduce (mandatory)
Minimal set of steps printing out the mediabox and cropbox for the above PDF:
Expected behavior (optional)
The mediabox and cropbox should be reported as identical, and match what other tools like pdinfo and Adobe Acrobat DC have.
Your configuration (mandatory)