Closed manish59 closed 6 years ago
Ok, let me sort this out ...
If you get 'NoneType' object has no attribute 'n'
, then the pixmap has no colorspace - it is either an "SMask" (for transparency data) of another pixmap, or it is a b/w pixmap for things like fax images.
It seems you are using a image extraction script from an earlier version of PyMuPDF.
Recent versions make a distinction between pixmaps having an alpha channel and others. Since that time, the attribute pix.n
can no longer safely be interpreted alone: you must also look at the presence of alpha.
The general formula is pixmap.n = pixmap.colorspace.n + alpha
if the pixmap has a colorspace at all. If the colorspace is None
, then the pixmap.samples
so to say only consist of alpha bytes, and it is nothing else than the alpha channel separated off of another pixmap in the PDF. With PyMuPDF, you can also convert such extreme cases to "normal" pixmaps - please see the pixmap chapter of the docu.
For extracting images from documents covering all the various possible situations, there are now 4 different scripts in the demo directory:
extract-img1.py
- This demo extracts all images of a PDF as PNG files that are referenced
by pages. It will check if the same image is referenced by multiple pages, if an image is just an "SMask", and also converts CMYK images to PNGs.extract-img2.py
- This demo extracts all images of a PDF as PNG files, whether they are
referenced by pages or not. This is a fault tolerant version and works for many even heavily damaged PDFs. It also tries to avoid "trivial" pictures (e.g. things like blue rectangles), or pictures that are very small.extract-img3.py
- Similar to extract-img1.py
, but it avoids using pixmaps where possible. This means that you will e.g. get JPEG images if the original image was of that format. Also automatically recognizes "SMask" images and avoids extracting them as standalone files.extract-img4.py
- In contrast to the other 3 scripts, this one works for all document types. It extracts images appearing on document pages and stores them with their original file extension where this is possible. In the general case, the same underlying image file can appear in different renderings on different pages, so recognition of duplicates will not safely work.Could I help you? Any more info needed?
I should have added a remark to pixmaps without colorspace: Not all PDFs contain images with transparency property. Or a b/w only image. Only in such cases the above problem occurs.
Whenever you encounter a pixmap with colorspace "None", you can either just skip it (which is safe for those that are "SMasks" = transparency info), or you can convert it to a pixmap with valid colorspace via
>>> mask # SMasks pixmaps look like this:
fitz.Pixmap(None, fitz.IRect(0, 0, 1168, 823), 1)
>>> pix = fitz.Pixmap(mask.getPNGData()) # convert it to "normal"
>>> pix
fitz.Pixmap(DeviceGRAY, fitz.IRect(0, 0, 1168, 823), 0)
>>> # if required, invert the gray values
>>> pix.invertIRect()
I was succesuffly extracted with your suggestion. Thanks for the help. I have another problem when I extracted the image it is getting upside down or landscape but it is normal in the actual pdf. When the chcecked the doc.rotation it gives me 0 degrees. Do you have any suggestions to fix this problem or how to identify the orientiation of the page
For a PDF, you can use Page.rotation
. Should give you some integer multiple of 90°.
I am not sure about your overall intention - but maybe an alternative approach for extracting images will help you:
If you are actually not interested to know which image appears on which page, you can directly scan through all objects of the PDF. This ingores the pages (and consequently does not need the page tree). If an object is an image, extract it, and:
This is more or less what extract-img2.py
does. This script has the following advantages:
Im sorry if I didnt explained correctly. For example I have a scanned pdf which contains 4 images as 4 pages. When I extract the images the extracted images orientation is distorted. I dont know why. I try but I dont know. Can you help me to identity the rotation of the page not the pdf.
Sure I am willing to. Can you send me a problem example - maybe directly via my e-mail. I will have a look at it.
Anyway, if you scanned something like 4 images to a PDF with then 4 pages, the internal PDF structure still is, that you have 4 normal PDF pages, which are filled completely by images. Each of those 4 images is embedded in the PDF, and you can extarct them in several different ways as I explained above.
If you determine that the images are not rotated as you expected or would have liked to, I can think of 2 reasons:
I assume it is reason 2. I have similar experiences with printer-scanner combi hardware from HP and Epson. If you cannot control the scanner software's behaviour, you have not many options left. The PDF, and also any viewer software (like MuPDF / PyMuPDF) cannot detect this. The only way is to manually decide later what to do with the image.
But again - do send me something, maybe I get another idea when looking at it.
In the mean time can you tell me how to increase the resolution of the extracted image. I tried using page.getPixma() and pix.writePng() then Im getting very less resolution of the image. Is there any way to increase the resolution and Im attaching a sample one.
Use a matrix for zooming like so.
zoom = 2 # zoom factor
mat = fitz.Matrix(zoom, zoom) # x and y direction could be zoomed independently
pix = page.getPixmap(matrix = mat, alpha = False)
The resulting png in this case has 4 times more pixels per area
can you give me your email id to send you a sample one
just use the one given in the home page jorj.x.mckie@outlook.com
sorry: outlook.de (I'm German)
you haven't sent me anything, so I assume your issue is fixed. Come back otherwise.
Yeah the issue is fixed. If you are the author of the library just put a sample to to extract image with resolution. That would be great
On 8/9/18, Jorj X. McKie notifications@github.com wrote:
you haven't sent me anything, so I assume your issue is fixed. Come back otherwise.
-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/rk700/PyMuPDF/issues/192#issuecomment-411941623
I am preparing a documentation update which will contain an extra chapter explaining typical tasks. I can see that the tutorial already contained therein is not always sufficient.
On the other hand, most people don't take the time to read the complete manual (now more than 150 pages) - so something in between would be beneficial ...
Hey JorjMckie, Can we detect color of the text in the pdf. I need to recognise the text of the color to differentiate between hyperlinks and non hyperlinks.
Hey @manish59, Sorry there is no way to do that (that I know of).
But here is what you could do:
page.getLinks()
. Each item in that list (a dict) contains the hot area in key "from", a fitz.Rect
. You could check for text within that rectangle, e.g. by using a script like this one.For example:
>>> import fitz
>>> doc = fitz.open("pymupdf.pdf")
>>> page = doc[8]
>>> lnks = page.getLinks()
>>> lnks[0] # first link on that page
{'kind': 2, 'xref': 639, 'from': fitz.Rect(220.30599975585938, 320.5679931640625, 253.1300048828125, 332.74200439453125), 'uri': 'http://www.mupdf.com/'}
>>> words = page.getTextWords() # list of all words on that page
>>> rect = lnks[0]["from"] # rect of the link
>>> for w in words: # browse through the words
wrect = fitz.Rect(w[:4]) # make a rect from the word's bbox coords
if wrect.intersects(rect): # check if word at least intersects link rect
print(w[4]) # print word if yes
MuPDF1
>>>
This an image of the resp. page part:
Hope that helps?
counterexample on page 28 of the Adobe PDF manual (link has no optical difference):
>>> doc = fitz.open("Adobe PDF Reference 1-7.pdf")
>>> page = doc[27]
>>> lnks = page.getLinks()
>>> lnks[0]
{'kind': 1, 'xref': 4595, 'from': fitz.Rect(210.83999633789062, 120.75799560546875, 264.6600036621094, 134.0780029296875), 'page': 1150, 'to': fitz.Point(196.0, 594.0), 'zoom': 0.0}
#....
>>> for w in words:
wrect = fitz.Rect(w[:4])
if wrect.intersects(rect): print(w[4])
Bibliography
>>>
Hey buddy thanks for the info but I was searching if there is any way to detect color of the text irrespective of it is a link or not.
On Sun, Aug 19, 2018 at 4:01 PM Jorj X. McKie notifications@github.com wrote:
counterexample on page 28 of the Adobe PDF manual (link has no optical difference):
doc = fitz.open("Adobe PDF Reference 1-7.pdf")>>> page = doc[27]>>> lnks = page.getLinks()>>> lnks[0] {'kind': 1, 'xref': 4595, 'from': fitz.Rect(210.83999633789062, 120.75799560546875, 264.6600036621094, 134.0780029296875), 'page': 1150, 'to': fitz.Point(196.0, 594.0), 'zoom': 0.0}#....>>> for w in words: wrect = fitz.Rect(w[:4]) if wrect.intersects(rect): print(w[4])
Bibliography>>>
[image: grafik] https://user-images.githubusercontent.com/8290722/44314078-1c727680-a3e2-11e8-96b9-43fba85c82b1.png
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rk700/PyMuPDF/issues/192#issuecomment-414162239, or mute the thread https://github.com/notifications/unsubscribe-auth/APQD9IZatP9fTP5F7V0TIibz5jsm0BrGks5uSe5ggaJpZM4VpXbm .
--
Use a matrix for zooming like so.
zoom = 2 # zoom factor mat = fitz.Matrix(zoom, zoom) # x and y direction could be zoomed independently pix = page.getPixmap(matrix = mat, alpha = False)
The resulting png in this case has 4 times more pixels per area
How would you determine zoom factor from a PDF file? Is it randomly assigned or is there a way to get the accurate number from PDF files using fitz? I read the manual but couldn't figure out a way to determine this zoom factor. Thanks for your help
doc = fitz.open("file.pdf") for i in range(len(doc)): for img in doc.getPageImageList(i): xref = img[0] pix = fitz.Pixmap(doc, xref) if pix.n < 5: # this is GRAY or RGB pix.writePNG("p%s-%s.png" % (i, xref)) else: # CMYK: convert to RGB first pix1 = fitz.Pixmap(fitz.csRGB, pix) pix1.writePNG("p%s-%s.png" % (i, xref)) pix1 = None pix = None I used this code to extract images from the pdf. I was sucessful for some pdfs but I was not able to do it for some pdfs. I was getting errors like 'NoneType' object has no attribute 'n' DeviceCMYK not supported for png. Please let me know how to fix this and extract images from the pdf