Cant extract image from pdf

manish59 commented 6 years ago

doc = fitz.open("file.pdf") for i in range(len(doc)): for img in doc.getPageImageList(i): xref = img[0] pix = fitz.Pixmap(doc, xref) if pix.n < 5: # this is GRAY or RGB pix.writePNG("p%s-%s.png" % (i, xref)) else: # CMYK: convert to RGB first pix1 = fitz.Pixmap(fitz.csRGB, pix) pix1.writePNG("p%s-%s.png" % (i, xref)) pix1 = None pix = None I used this code to extract images from the pdf. I was sucessful for some pdfs but I was not able to do it for some pdfs. I was getting errors like 'NoneType' object has no attribute 'n' DeviceCMYK not supported for png. Please let me know how to fix this and extract images from the pdf

JorjMcKie commented 6 years ago

Ok, let me sort this out ...

If you get 'NoneType' object has no attribute 'n', then the pixmap has no colorspace - it is either an "SMask" (for transparency data) of another pixmap, or it is a b/w pixmap for things like fax images.

It seems you are using a image extraction script from an earlier version of PyMuPDF. Recent versions make a distinction between pixmaps having an alpha channel and others. Since that time, the attribute pix.n can no longer safely be interpreted alone: you must also look at the presence of alpha.

The general formula is pixmap.n = pixmap.colorspace.n + alpha if the pixmap has a colorspace at all. If the colorspace is None, then the pixmap.samples so to say only consist of alpha bytes, and it is nothing else than the alpha channel separated off of another pixmap in the PDF. With PyMuPDF, you can also convert such extreme cases to "normal" pixmaps - please see the pixmap chapter of the docu.

For extracting images from documents covering all the various possible situations, there are now 4 different scripts in the demo directory:

extract-img1.py - This demo extracts all images of a PDF as PNG files that are referenced by pages. It will check if the same image is referenced by multiple pages, if an image is just an "SMask", and also converts CMYK images to PNGs.
extract-img2.py - This demo extracts all images of a PDF as PNG files, whether they are referenced by pages or not. This is a fault tolerant version and works for many even heavily damaged PDFs. It also tries to avoid "trivial" pictures (e.g. things like blue rectangles), or pictures that are very small.
extract-img3.py - Similar to extract-img1.py, but it avoids using pixmaps where possible. This means that you will e.g. get JPEG images if the original image was of that format. Also automatically recognizes "SMask" images and avoids extracting them as standalone files.
extract-img4.py - In contrast to the other 3 scripts, this one works for all document types. It extracts images appearing on document pages and stores them with their original file extension where this is possible. In the general case, the same underlying image file can appear in different renderings on different pages, so recognition of duplicates will not safely work.

JorjMcKie commented 6 years ago

Could I help you? Any more info needed?

JorjMcKie commented 6 years ago

I should have added a remark to pixmaps without colorspace: Not all PDFs contain images with transparency property. Or a b/w only image. Only in such cases the above problem occurs.

Whenever you encounter a pixmap with colorspace "None", you can either just skip it (which is safe for those that are "SMasks" = transparency info), or you can convert it to a pixmap with valid colorspace via

>>> mask                       # SMasks pixmaps look like this:
fitz.Pixmap(None, fitz.IRect(0, 0, 1168, 823), 1)
>>> pix = fitz.Pixmap(mask.getPNGData()) # convert it to "normal"
>>> pix
fitz.Pixmap(DeviceGRAY, fitz.IRect(0, 0, 1168, 823), 0)
>>> # if required, invert the gray values
>>> pix.invertIRect()

manish59 commented 6 years ago

I was succesuffly extracted with your suggestion. Thanks for the help. I have another problem when I extracted the image it is getting upside down or landscape but it is normal in the actual pdf. When the chcecked the doc.rotation it gives me 0 degrees. Do you have any suggestions to fix this problem or how to identify the orientiation of the page

JorjMcKie commented 6 years ago

For a PDF, you can use Page.rotation. Should give you some integer multiple of 90°. I am not sure about your overall intention - but maybe an alternative approach for extracting images will help you:

If you are actually not interested to know which image appears on which page, you can directly scan through all objects of the PDF. This ingores the pages (and consequently does not need the page tree). If an object is an image, extract it, and:

either directly store it away with its original file extension (jpeg, tiff, whatever), or,
if an SMask is specified, only then convert it to a Pixmap with the SMask applied, and then store it as a PNG.

This is more or less what extract-img2.py does. This script has the following advantages:

it also works for quite severely damaged PDFs (most other tools would refuse to even start working here)
it contains some logic to ignore images that are not interesting (too small, just unicolor rectangles, ridiculous width-height-ratio, whatever)
it only uses SMasks for transparency handling and ignores them otherwise
it stores images in there original format (not all had been born as PNG ...), and does not care about page rotation, because it does not even know pages
it is fast!

manish59 commented 6 years ago

Im sorry if I didnt explained correctly. For example I have a scanned pdf which contains 4 images as 4 pages. When I extract the images the extracted images orientation is distorted. I dont know why. I try but I dont know. Can you help me to identity the rotation of the page not the pdf.

JorjMcKie commented 6 years ago

Sure I am willing to. Can you send me a problem example - maybe directly via my e-mail. I will have a look at it.

Anyway, if you scanned something like 4 images to a PDF with then 4 pages, the internal PDF structure still is, that you have 4 normal PDF pages, which are filled completely by images. Each of those 4 images is embedded in the PDF, and you can extarct them in several different ways as I explained above.

If you determine that the images are not rotated as you expected or would have liked to, I can think of 2 reasons:

(trivial) you placed the images on the scanner in an awkward way :-)
your scanner software made its own decisions about what would be an adequate orientation.

I assume it is reason 2. I have similar experiences with printer-scanner combi hardware from HP and Epson. If you cannot control the scanner software's behaviour, you have not many options left. The PDF, and also any viewer software (like MuPDF / PyMuPDF) cannot detect this. The only way is to manually decide later what to do with the image.

But again - do send me something, maybe I get another idea when looking at it.

manish59 commented 6 years ago

In the mean time can you tell me how to increase the resolution of the extracted image. I tried using page.getPixma() and pix.writePng() then Im getting very less resolution of the image. Is there any way to increase the resolution and Im attaching a sample one.

JorjMcKie commented 6 years ago

Use a matrix for zooming like so.

zoom = 2 # zoom factor
mat = fitz.Matrix(zoom, zoom) # x and y direction could be zoomed independently
pix = page.getPixmap(matrix = mat, alpha = False)

The resulting png in this case has 4 times more pixels per area

manish59 commented 6 years ago

can you give me your email id to send you a sample one

JorjMcKie commented 6 years ago

just use the one given in the home page jorj.x.mckie@outlook.com

JorjMcKie commented 6 years ago

sorry: outlook.de (I'm German)

JorjMcKie commented 6 years ago

you haven't sent me anything, so I assume your issue is fixed. Come back otherwise.

manish59 commented 6 years ago

Yeah the issue is fixed. If you are the author of the library just put a sample to to extract image with resolution. That would be great

On 8/9/18, Jorj X. McKie notifications@github.com wrote:

you haven't sent me anything, so I assume your issue is fixed. Come back otherwise.

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/rk700/PyMuPDF/issues/192#issuecomment-411941623

JorjMcKie commented 6 years ago

I am preparing a documentation update which will contain an extra chapter explaining typical tasks. I can see that the tutorial already contained therein is not always sufficient.

On the other hand, most people don't take the time to read the complete manual (now more than 150 pages) - so something in between would be beneficial ...

manish59 commented 6 years ago

Hey JorjMckie, Can we detect color of the text in the pdf. I need to recognise the text of the color to differentiate between hyperlinks and non hyperlinks.

JorjMcKie commented 6 years ago

Hey @manish59, Sorry there is no way to do that (that I know of).

But here is what you could do:

there may be hyperlinks whose "hot area" is not encircling text (but an image instead)
there is no convention whatsoever saying that text in hyperlinks has to have a certain color - it could be no different from other text around. This is the case e.g. by the hypelinks in the Adobe manual for example.
a list of all hyperlinks on a page is provided by page.getLinks(). Each item in that list (a dict) contains the hot area in key "from", a fitz.Rect. You could check for text within that rectangle, e.g. by using a script like this one.

For example:

>>> import fitz
>>> doc = fitz.open("pymupdf.pdf")
>>> page = doc[8]
>>> lnks = page.getLinks()
>>> lnks[0]             # first link on that page
{'kind': 2, 'xref': 639, 'from': fitz.Rect(220.30599975585938, 320.5679931640625, 253.1300048828125, 332.74200439453125), 'uri': 'http://www.mupdf.com/'}
>>> words = page.getTextWords()  # list of all words on that page
>>> rect = lnks[0]["from"]     # rect of the link
>>> for w in words:              # browse through the words
    wrect = fitz.Rect(w[:4])   # make a rect from the word's bbox coords
    if wrect.intersects(rect):  # check if word at least intersects link rect
        print(w[4])                 # print word if yes

MuPDF1
>>>

This an image of the resp. page part:

Hope that helps?

JorjMcKie commented 6 years ago

counterexample on page 28 of the Adobe PDF manual (link has no optical difference):

>>> doc = fitz.open("Adobe PDF Reference 1-7.pdf")
>>> page = doc[27]
>>> lnks = page.getLinks()
>>> lnks[0]
{'kind': 1, 'xref': 4595, 'from': fitz.Rect(210.83999633789062, 120.75799560546875, 264.6600036621094, 134.0780029296875), 'page': 1150, 'to': fitz.Point(196.0, 594.0), 'zoom': 0.0}
#....
>>> for w in words:
    wrect = fitz.Rect(w[:4])
    if wrect.intersects(rect): print(w[4])

Bibliography
>>>

grafik

manish59 commented 6 years ago

Hey buddy thanks for the info but I was searching if there is any way to detect color of the text irrespective of it is a link or not.

On Sun, Aug 19, 2018 at 4:01 PM Jorj X. McKie notifications@github.com wrote:

counterexample on page 28 of the Adobe PDF manual (link has no optical difference):

doc = fitz.open("Adobe PDF Reference 1-7.pdf")>>> page = doc[27]>>> lnks = page.getLinks()>>> lnks[0] {'kind': 1, 'xref': 4595, 'from': fitz.Rect(210.83999633789062, 120.75799560546875, 264.6600036621094, 134.0780029296875), 'page': 1150, 'to': fitz.Point(196.0, 594.0), 'zoom': 0.0}#....>>> for w in words: wrect = fitz.Rect(w[:4]) if wrect.intersects(rect): print(w[4])

Bibliography>>>

[image: grafik] https://user-images.githubusercontent.com/8290722/44314078-1c727680-a3e2-11e8-96b9-43fba85c82b1.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rk700/PyMuPDF/issues/192#issuecomment-414162239, or mute the thread https://github.com/notifications/unsubscribe-auth/APQD9IZatP9fTP5F7V0TIibz5jsm0BrGks5uSe5ggaJpZM4VpXbm .

--

liamsuma commented 4 years ago

Use a matrix for zooming like so.
zoom = 2 # zoom factor
mat = fitz.Matrix(zoom, zoom) # x and y direction could be zoomed independently
pix = page.getPixmap(matrix = mat, alpha = False)
The resulting png in this case has 4 times more pixels per area

How would you determine zoom factor from a PDF file? Is it randomly assigned or is there a way to get the accurate number from PDF files using fitz? I read the manual but couldn't figure out a way to determine this zoom factor. Thanks for your help

pymupdf / PyMuPDF

Cant extract image from pdf #192