pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.1k stars 491 forks source link

when instance Rect, how use bottom left as corner? #279

Closed seven1122 closed 5 years ago

seven1122 commented 5 years ago

I am new to python . I have known the example (rect = fitz.Rect(0, 0, 181.875, 189.375)) is using upper left as corner. But I don't know how to use bottom left as corner when I try to instance Rect. Please tell me, thanks very much!

JorjMcKie commented 5 years ago

I guess I don't understand: what button do you mean?

seven1122 commented 5 years ago

I guess I don't understand: what button do you mean? I'm sorry for giving a wrong word.I wanted to express bottom left.But I guess I have known how to locate my image .Thank you! While I have another problem that the image flipped over and its size is far less than I set up, which only occured in some pdf file or some pages of one pdf file. Does it relate to the pdf file?

JorjMcKie commented 5 years ago

Ah I see, an English language issue :-) Never mind.

The corners of a rectangle rect can be accessed via rect.top_left, rect.top_right, rect.bottom_left, rect.bottom_right. There also exist the abbreviations tl, tr, bl, br (rect.tl, etc.).

Your new question is again not clear to me. What are you trying to do, and what did go wrong? Are you trying to insert something on an existing PDF page and it does not appear as expected, or what?

seven1122 commented 5 years ago

Ah I see, an English language issue :-) Never mind.

The corners of a rectangle rect can be accessed via rect.top_left, rect.top_right, rect.bottom_left, rect.bottom_right. There also exist the abbreviations tl, tr, bl, br (rect.tl, etc.).

Your new question is again not clear to me. What are you trying to do, and what did go wrong? Are you trying to insert something on an existing PDF page and it does not appear as expected, or what? yes ,I try to insert image on an existing PDF file, but it does not appear as expected on some pages of one PDF file or all pages of one PDF file. The following picture hapened on two diffrent pages of one PDF file. image

JorjMcKie commented 5 years ago

but sometimes it works fine?

This looks like a problem, many others have reported, too. For an explanation, you need some background on PDF theory:

Like everything else, a page in PDF is an "object" described by a text string in a special format. For example this one:

>>> page = doc[0]  # read first doc page
>>> xref = page.xref  # get object number of the page
>>> print(doc._getXrefString(xref))  # print the page definition
<</Contents 40 0 R/Type/Page/MediaBox[0 0 595.32 841.92]/Rotate 0/Parent 12 0 R
/Resources<</ExtGState<</R7 26 0 R>>
/Font<</R8 27 0 R/R10 21 0 R/R12 24 0 R/R14 15 0 R/R17 4 0 R/R20 30 0 R/R23 7 0 R/R27 20 0 R>>
/ProcSet[/PDF/Text]>>/Annots[55 0 R]>>

Special keywords in this definition point to more information required to compose the page for display. An important example for your problem is /Contents. This object type describes which things should appear where and how (text, images, etc.). In our example, the contents object has number 40 (xref 40 = cross reference number). Using PyMuPDF, we can access this object like this:

>>> print(doc._getXrefString(40))
<</Length 7107/Filter/FlateDecode>>
>>> 

This is not much information (7107 bytes complessed via method FlateDecode). But object 40 is followed by a so-called stream, which contains (compressed) data. These data are formulated in a special mini-language (very similar to PostScript), which describe the page layout. Again using PyMuPDF, we can again access those data using this:

>>> contents = doc._getXrefStream(40)  # read the stream of object 40
>>> len(contents)  # the method decompressed it for us:
30950
>>> contents[:30]  # show first 30 bytes of it
b'q\n.1 0 0 .1 0 0 cm\n/R7 gs\nq\nBT'
>>> contents[-30:]  # show last 30 bytes of it
b'9 4.80078 4.80078 re\n0 g\nf*\nQ\n'
>>> # note that contents is a bytes object in Python

Important for your problem is, that contents starts with a q and ends with a Q. Both are commands in the mini-language. They stack (q) and restore (Q) the so-called graphics context. The effect of these commands is to keep the effects of all commands "local", which change the geometry of the page. In our example, the q-command is followed by a matrix .1 0 0 .1 0 0 cm (in this syntax). This matrix here scales down everything what follows down by a factor of 10 (see the values .1).

If you now insert something (like text or a rectangle) on this page, it would appear 10 times smaller than you expected ... if the existing contents were not encapsulated by q ... Q.

I offer a bet, that in your cases, the contents stream is not properly encapsulated by q ... Q like this. So your insertions are under control of whatever geometry changes (scaling, rotations, ...) already exist. Try my commands from above to see if this is true and come back to discuss solutions.

seven1122 commented 5 years ago

but sometimes it works fine?

This looks like a problem, many others have reported, too. For an explanation, you need some background on PDF theory:

Like everything else, a page in PDF is an "object" described by a text string in a special format. For example this one:

>>> page = doc[0]  # read first doc page
>>> xref = page.xref  # get object number of the page
>>> print(doc._getXrefString(xref))  # print the page definition
<</Contents 40 0 R/Type/Page/MediaBox[0 0 595.32 841.92]/Rotate 0/Parent 12 0 R
/Resources<</ExtGState<</R7 26 0 R>>
/Font<</R8 27 0 R/R10 21 0 R/R12 24 0 R/R14 15 0 R/R17 4 0 R/R20 30 0 R/R23 7 0 R/R27 20 0 R>>
/ProcSet[/PDF/Text]>>/Annots[55 0 R]>>

Special keywords in this definition point to more information required to compose the page for display. An important example for your problem is /Contents. This object type describes which things should appear where and how (text, images, etc.). In our example, the contents object has number 40 (xref 40 = cross reference number). Using PyMuPDF, we can access this object like this:

>>> print(doc._getXrefString(40))
<</Length 7107/Filter/FlateDecode>>
>>> 

This is not much information (7107 bytes complessed via method FlateDecode). But object 40 is followed by a so-called stream, which contains (compressed) data. These data are formulated in a special mini-language (very similar to PostScript), which describe the page layout. Again using PyMuPDF, we can again access those data using this:

>>> contents = doc._getXrefStream(40)  # read the stream of object 40
>>> len(contents)  # the method decompressed it for us:
30950
>>> contents[:30]  # show first 30 bytes of it
b'q\n.1 0 0 .1 0 0 cm\n/R7 gs\nq\nBT'
>>> contents[-30:]  # show last 30 bytes of it
b'9 4.80078 4.80078 re\n0 g\nf*\nQ\n'
>>> # note that contents is a bytes object in Python

Important for your problem is, that contents starts with a q and ends with a Q. Both are commands in the mini-language. They stack (q) and restore (Q) the so-called graphics context. The effect of these commands is to keep the effects of all commands "local", which change the geometry of the page. In our example, the q-command is followed by a matrix .1 0 0 .1 0 0 cm (in this syntax). This matrix here scales down everything what follows down by a factor of 10 (see the values .1).

If you now insert something (like text or a rectangle) on this page, it would appear 10 times smaller than you expected ... if the existing contents were not encapsulated by q ... Q.

I offer a bet, that in your cases, the contents stream is not properly encapsulated by q ... Q like this. So your insertions are under control of whatever geometry changes (scaling, rotations, ...) already exist. Try my commands from above to see if this is true and come back to discuss solutions.

After trying,I Guess you are right.But there is a little diffrent from you said.it does not startwith q (eg,'/GSa gs /CSp cs /CSp CS 0.0600'),but realy endwith Q (eg,' 0 SCN\n0 w 2 J 2 j [] 0 d\nQ Q\n'). Any solutions to solve this problem? Thank you very much.

JorjMcKie commented 5 years ago

the simplest way is to perform a "clean" of the page before you insert anything. This is a PyMuPDF method which hopefully works. Unfortunately, the underlying MuPDF function is not reliable and may lead to an empty page. If this is happening, let's try something else, but we hope for the best and do this:

page = doc[n]  # read the desired page
page._cleanContents()  # perform a page clean
# now do your inserts

Another option, using the same MuPDF function, is pre-processing your PDF and then using its output for your script. This needs the command line tool from MuPDF, which can be downloaded for Windows, or compiled from MuPDF source:

mutool clean -sc input.pdf output.pdf

Come back if none of this works. We can also use other PyMuPDF logic to wrap the existing contents with q ... Q.

JorjMcKie commented 5 years ago

It may be that a contents stream looks like your case /GSa gs /CSp cs /CSp CS ... q ... Q But the part before the "q" becomes active again after the "Q", so you would again have the original geometry re-established, which causes your problem. What is need is a wrapping like so: q /GSa gs /CSp cs /CSp CS ... q ... Q Q

seven1122 commented 5 years ago

@JorjMcKie Thank you for your help.I guess I have solved my problem following your instructions

BGEray commented 5 years ago

Hi,@JorjMcKie : I have encountered the same insert image size not right problem,then I found your reply in this issuse,so I use "page._cleanContents()" in my code,the size problem was solved,but a few other problem came out:

1.When I open the new file in Microsoft Edge browser,it has no content and only show annotation‘s frame, in Adobe Acrobat Reader DC,it cause an error and show the same thing.Other browsers like chrome and firefox can open the new file right.I use "mutool clean -sc input.pdf output.pdf" too, but the new file has the same problem. My original pdf file was export by AutoCAD from a DWG file, so it has lots of useless annotations. So why the frame of annotations came out unprovoked, and is there any solutions can both solve the size problem and edge/Adobe reader problem?Please give me some advices.

2.the new file is much larger than the original one, how can I compress it? Here's my original pdf file and new created one. Original.pdf after_clean_and_insert.pdf

JorjMcKie commented 5 years ago

@BGEray - I know what it is:

_cleanContents() does not only clean the contents of the page, but also those of any annotations. And it does this job not always bug free, unfortunately.

For this reason (and a few more) I have developed a solution explained here. Choose solution 3 explained there. It will wrap the page's content (only) with the missing stacking / unstacking commands. It does solve your problem - also in terms of file size, correct display in several viewers, including Adobe, Foxit, browsers.

The mentioned "homegrown" solution will be available as one new compact method _wrapContents() in the next (1.16.0) version. Part of that version will also be a small utility method which checks beforehand whether such wrapping may be required and thus help prevent running into such issues.

Another comment: The whole issue only occurs, when items are inserted in foreground - on top of the stuff already there. It does not happen, if you use overlay=False (which doesn't always make sense of course).