pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.02k stars 482 forks source link

Different scaling of added content after merging pages #3284

Closed S1lander closed 5 months ago

S1lander commented 5 months ago

Description of the bug

I am currently developing an app for a client. One part of the functionality that I've built in is that it will merge several PDF's together and then add multiple textboxes to all pages (same text, same location). The text always got added correctly to the pages, but recetly the client sent me some new PDF's files to test on, and now the added text has a different scaling on some pages that were pasted in from one of these new files.

When I read the differences of the metadata of the files that have the correct scaling, and those which don't, I'm getting these differences: Difference in /Creator: PDF1 -> AutoCAD 2024 - English 2024 (24.3s (LMS Tech)), PDF2 -> AutoCAD LT 2024 - English 2024 (24.3s (LMS Tech)) Difference in /CreationDate: PDF1 -> D:20240125162128Z, PDF2 -> D:20240201133816Z Difference in /ModDate: PDF1 -> D:20240220144906-06'00', PDF2 -> D:20240201143940-06'00' Difference in /Title: PDF1 -> S-4.1 Framing Sections & Details, PDF2 -> B LFE - Blabla 5-5 - Exp C - S-3.0 Difference in /Producer: PDF1 -> pdfplot16.hdi 16.03.061.00000, PDF2 -> pdfplot16.hdi 16.03.152.00000 Both files have the same page size.

The main difference from what I can tell, is that on one occasion, AutoCAD was used and on the other AutoCAD LT. Obviously I don't want the client having to take that into account when exporting the PDF's. Reportlab does not have this problem, but I'd prefer to use fitz, because reportlab is waaaay slower than fitz. I haven't gone into the code of fitz yet, and thought, maybe someone has knowledge about why it might scale added content differently based on pdf properties.

Would love to get this sorted! I can provide some of the code if needed :)

How to reproduce the bug

  1. Merge 2 different pdf files from different sources (AutoCAD / AutoCAD LT) using PyPDF2
  2. Add text using fitzs' .insert_textbox()
  3. Export PDF

(The order doesn't matter. You can also merge and export the file first, and then add the text and it will still have the wrong scaling.)

Problem: The scaling of the content will be different for some of the pages based on which pdf the page originated from. Correct: Screenshot 2024-03-20 at 11 52 13 Wrong: Screenshot 2024-03-20 at 11 52 34

You can see a basic table structure on all pages, and on some of the pages the added text content is uniformly scaled down. Origin for the scaling seems to be the bottom right corner of the page.

Expected: The added text should have the same scaling on all of the pages. If I use reportlab to then take that merged and exported pdf, and add even more text to all the pages, it does so using the same scaling on all pages. When I do the same thing using fitz again, it will still have the wrong scaling.

PyMuPDF version

1.23.25

Operating system

MacOS

Python version

3.10

JorjMcKie commented 5 months ago

This looks like using PyMuPDF in a production / commercial environment. Either you or your client should probably own a commercial license - please confirm the license situation with Artifex.

On a technical level, it is hard to understand why PyPDF2 is being used for merging files, vis-a-vis also employing PyMuPDF, which is orders of magnitude faster here.

S1lander commented 5 months ago

This looks like using PyMuPDF in a production / commercial environment. Either you or your client should probably own a commercial license - please confirm the license situation with Artifex.

On a technical level, it is hard to understand why PyPDF2 is being used for merging files, vis-a-vis also employing PyMuPDF, which is orders of magnitude faster here.

Thanks for your comments. All good points. I actually just noticed that I misstyped; I am using fitz to merge the files. Do you have any idea how I could get the scaling to be uniform?