pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.49k stars 513 forks source link

Cant extract image from pdf #192

Closed manish59 closed 6 years ago

manish59 commented 6 years ago

doc = fitz.open("file.pdf") for i in range(len(doc)): for img in doc.getPageImageList(i): xref = img[0] pix = fitz.Pixmap(doc, xref) if pix.n < 5: # this is GRAY or RGB pix.writePNG("p%s-%s.png" % (i, xref)) else: # CMYK: convert to RGB first pix1 = fitz.Pixmap(fitz.csRGB, pix) pix1.writePNG("p%s-%s.png" % (i, xref)) pix1 = None pix = None I used this code to extract images from the pdf. I was sucessful for some pdfs but I was not able to do it for some pdfs. I was getting errors like 'NoneType' object has no attribute 'n' DeviceCMYK not supported for png. Please let me know how to fix this and extract images from the pdf

JorjMcKie commented 6 years ago

Ok, let me sort this out ...

If you get 'NoneType' object has no attribute 'n', then the pixmap has no colorspace - it is either an "SMask" (for transparency data) of another pixmap, or it is a b/w pixmap for things like fax images.

It seems you are using a image extraction script from an earlier version of PyMuPDF. Recent versions make a distinction between pixmaps having an alpha channel and others. Since that time, the attribute pix.n can no longer safely be interpreted alone: you must also look at the presence of alpha.

The general formula is pixmap.n = pixmap.colorspace.n + alpha if the pixmap has a colorspace at all. If the colorspace is None, then the pixmap.samples so to say only consist of alpha bytes, and it is nothing else than the alpha channel separated off of another pixmap in the PDF. With PyMuPDF, you can also convert such extreme cases to "normal" pixmaps - please see the pixmap chapter of the docu.

For extracting images from documents covering all the various possible situations, there are now 4 different scripts in the demo directory:

JorjMcKie commented 6 years ago

Could I help you? Any more info needed?

JorjMcKie commented 6 years ago

I should have added a remark to pixmaps without colorspace: Not all PDFs contain images with transparency property. Or a b/w only image. Only in such cases the above problem occurs.

Whenever you encounter a pixmap with colorspace "None", you can either just skip it (which is safe for those that are "SMasks" = transparency info), or you can convert it to a pixmap with valid colorspace via

>>> mask                       # SMasks pixmaps look like this:
fitz.Pixmap(None, fitz.IRect(0, 0, 1168, 823), 1)
>>> pix = fitz.Pixmap(mask.getPNGData()) # convert it to "normal"
>>> pix
fitz.Pixmap(DeviceGRAY, fitz.IRect(0, 0, 1168, 823), 0)
>>> # if required, invert the gray values
>>> pix.invertIRect()
manish59 commented 6 years ago

I was succesuffly extracted with your suggestion. Thanks for the help. I have another problem when I extracted the image it is getting upside down or landscape but it is normal in the actual pdf. When the chcecked the doc.rotation it gives me 0 degrees. Do you have any suggestions to fix this problem or how to identify the orientiation of the page

JorjMcKie commented 6 years ago

For a PDF, you can use Page.rotation. Should give you some integer multiple of 90°. I am not sure about your overall intention - but maybe an alternative approach for extracting images will help you:

If you are actually not interested to know which image appears on which page, you can directly scan through all objects of the PDF. This ingores the pages (and consequently does not need the page tree). If an object is an image, extract it, and:

This is more or less what extract-img2.py does. This script has the following advantages:

manish59 commented 6 years ago

Im sorry if I didnt explained correctly. For example I have a scanned pdf which contains 4 images as 4 pages. When I extract the images the extracted images orientation is distorted. I dont know why. I try but I dont know. Can you help me to identity the rotation of the page not the pdf.

JorjMcKie commented 6 years ago

Sure I am willing to. Can you send me a problem example - maybe directly via my e-mail. I will have a look at it.

Anyway, if you scanned something like 4 images to a PDF with then 4 pages, the internal PDF structure still is, that you have 4 normal PDF pages, which are filled completely by images. Each of those 4 images is embedded in the PDF, and you can extarct them in several different ways as I explained above.

If you determine that the images are not rotated as you expected or would have liked to, I can think of 2 reasons:

  1. (trivial) you placed the images on the scanner in an awkward way :-)
  2. your scanner software made its own decisions about what would be an adequate orientation.

I assume it is reason 2. I have similar experiences with printer-scanner combi hardware from HP and Epson. If you cannot control the scanner software's behaviour, you have not many options left. The PDF, and also any viewer software (like MuPDF / PyMuPDF) cannot detect this. The only way is to manually decide later what to do with the image.

But again - do send me something, maybe I get another idea when looking at it.

manish59 commented 6 years ago

In the mean time can you tell me how to increase the resolution of the extracted image. I tried using page.getPixma() and pix.writePng() then Im getting very less resolution of the image. Is there any way to increase the resolution and Im attaching a sample one.

JorjMcKie commented 6 years ago

Use a matrix for zooming like so.

zoom = 2 # zoom factor
mat = fitz.Matrix(zoom, zoom) # x and y direction could be zoomed independently
pix = page.getPixmap(matrix = mat, alpha = False)

The resulting png in this case has 4 times more pixels per area

manish59 commented 6 years ago

can you give me your email id to send you a sample one

JorjMcKie commented 6 years ago

just use the one given in the home page jorj.x.mckie@outlook.com

JorjMcKie commented 6 years ago

sorry: outlook.de (I'm German)

JorjMcKie commented 6 years ago

you haven't sent me anything, so I assume your issue is fixed. Come back otherwise.

manish59 commented 6 years ago

Yeah the issue is fixed. If you are the author of the library just put a sample to to extract image with resolution. That would be great

On 8/9/18, Jorj X. McKie notifications@github.com wrote:

you haven't sent me anything, so I assume your issue is fixed. Come back otherwise.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/rk700/PyMuPDF/issues/192#issuecomment-411941623

JorjMcKie commented 6 years ago

I am preparing a documentation update which will contain an extra chapter explaining typical tasks. I can see that the tutorial already contained therein is not always sufficient.

On the other hand, most people don't take the time to read the complete manual (now more than 150 pages) - so something in between would be beneficial ...

manish59 commented 6 years ago

Hey JorjMckie, Can we detect color of the text in the pdf. I need to recognise the text of the color to differentiate between hyperlinks and non hyperlinks.

JorjMcKie commented 6 years ago

Hey @manish59, Sorry there is no way to do that (that I know of).

But here is what you could do:

  1. there may be hyperlinks whose "hot area" is not encircling text (but an image instead)
  2. there is no convention whatsoever saying that text in hyperlinks has to have a certain color - it could be no different from other text around. This is the case e.g. by the hypelinks in the Adobe manual for example.
  3. a list of all hyperlinks on a page is provided by page.getLinks(). Each item in that list (a dict) contains the hot area in key "from", a fitz.Rect. You could check for text within that rectangle, e.g. by using a script like this one.

For example:

>>> import fitz
>>> doc = fitz.open("pymupdf.pdf")
>>> page = doc[8]
>>> lnks = page.getLinks()
>>> lnks[0]             # first link on that page
{'kind': 2, 'xref': 639, 'from': fitz.Rect(220.30599975585938, 320.5679931640625, 253.1300048828125, 332.74200439453125), 'uri': 'http://www.mupdf.com/'}
>>> words = page.getTextWords()  # list of all words on that page
>>> rect = lnks[0]["from"]     # rect of the link
>>> for w in words:              # browse through the words
    wrect = fitz.Rect(w[:4])   # make a rect from the word's bbox coords
    if wrect.intersects(rect):  # check if word at least intersects link rect
        print(w[4])                 # print word if yes

MuPDF1
>>> 

This an image of the resp. page part:


grafik

Hope that helps?

JorjMcKie commented 6 years ago

counterexample on page 28 of the Adobe PDF manual (link has no optical difference):

>>> doc = fitz.open("Adobe PDF Reference 1-7.pdf")
>>> page = doc[27]
>>> lnks = page.getLinks()
>>> lnks[0]
{'kind': 1, 'xref': 4595, 'from': fitz.Rect(210.83999633789062, 120.75799560546875, 264.6600036621094, 134.0780029296875), 'page': 1150, 'to': fitz.Point(196.0, 594.0), 'zoom': 0.0}
#....
>>> for w in words:
    wrect = fitz.Rect(w[:4])
    if wrect.intersects(rect): print(w[4])

Bibliography
>>> 

grafik

manish59 commented 6 years ago

Hey buddy thanks for the info but I was searching if there is any way to detect color of the text irrespective of it is a link or not.

On Sun, Aug 19, 2018 at 4:01 PM Jorj X. McKie notifications@github.com wrote:

counterexample on page 28 of the Adobe PDF manual (link has no optical difference):

doc = fitz.open("Adobe PDF Reference 1-7.pdf")>>> page = doc[27]>>> lnks = page.getLinks()>>> lnks[0] {'kind': 1, 'xref': 4595, 'from': fitz.Rect(210.83999633789062, 120.75799560546875, 264.6600036621094, 134.0780029296875), 'page': 1150, 'to': fitz.Point(196.0, 594.0), 'zoom': 0.0}#....>>> for w in words: wrect = fitz.Rect(w[:4]) if wrect.intersects(rect): print(w[4])

Bibliography>>>

[image: grafik] https://user-images.githubusercontent.com/8290722/44314078-1c727680-a3e2-11e8-96b9-43fba85c82b1.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rk700/PyMuPDF/issues/192#issuecomment-414162239, or mute the thread https://github.com/notifications/unsubscribe-auth/APQD9IZatP9fTP5F7V0TIibz5jsm0BrGks5uSe5ggaJpZM4VpXbm .

--

liamsuma commented 4 years ago

Use a matrix for zooming like so.

zoom = 2 # zoom factor
mat = fitz.Matrix(zoom, zoom) # x and y direction could be zoomed independently
pix = page.getPixmap(matrix = mat, alpha = False)

The resulting png in this case has 4 times more pixels per area

How would you determine zoom factor from a PDF file? Is it randomly assigned or is there a way to get the accurate number from PDF files using fitz? I read the manual but couldn't figure out a way to determine this zoom factor. Thanks for your help