Closed MikeTheWatchGuy closed 6 years ago
I will look into it. In the meantime please let me know your configuration (OS, PyMuPDF version, bitness, Python version etc.). Thanks!
Hm - it seems you have detected an error in the underlying C library MuPDF. Of course they normally catch broken documents and are able to recover from more situations than many other products. But the damage to your document is unfortunate enough to exhibit an uncovered situation there. I will file a bug in their system. Is there anything in your file that would prohibit sending it to them (data protection)? Depending on your urgency, I could develop a quick and dirty circumvention in the following way:
try:
b = open("file.pdf", "rb").read()
doc = fitz.open("pdf", b)
# do something with doc
except:
# handle invalid document here
This would give me some control over the logic flow.
You could of course do something similar:
Your document's tail is cut off - it looks just fine until it prematurely ends. It does not end with characters %%EOF<LF>
, which is mandatory. So this is easily detectable: if not b.endswith("%%EOF\n"): -> error
.
Please let me know your reaction.
Another investigation result: I found the place, where MuPDF fails to catch an exception. After correcting this in their C code, re-generating MuPDF and PyMuPDF, opening your error document now leads to a proper exception in Python.
Again my question: what is your configuration?
Wow you JUMPED on this but quick!! I'm mega impressed. Thank you for the IMMEDIATE fix to my own code, the ability to check for an illegal file so I don't make the call.
A larger view of my call to open is this:
try:
doc = fitz.open(input_path + '\\' + file)
except:
continue
Sorry I'm slow in comparison...
My setup: Pycharm Anaconda distribution of Python 3.6.2 (Sept 19 2017)\ Pip list tells me that I have 'fitz 1.11.1' Windows 10 Core i9 CPU 64 GB RAM (48 free)
I'll add the checking for EOF as that's such a simple thing and will likely solve my issue. I'll also put in any fixes that other people add and see if I am able then to catch the exceptions instead of crash. Thank you! -mb (BTW... having some issues installing packages using pip on windows. was working fine. Now have to download and run setup instead of pip install)
My code now looks like this and is humming along.... thanks to your help!
with open(input_path + '\\' + file, "rb") as f: # see if the file is a legal PDF file
b = f.read()
if not b.endswith(b"%%EOF\n"): # quick PDF file validation using end-of-file marker
print('* file does not end in EOF * skipping file {}'.format(file))
continue
# =================== Open the PDF file ====================#
try:
doc = fitz.open(input_path + '\\' + file)
except:
continue
I'm sure this could be written in a better or clearer way. I'm only a couple months with Python.
Thanks for the flattering compliments :-) I have to admit, I am a seasoned guy living in Venezuela after my retirement ... so I do have enough time. After all I can't be in the swimming pool or harvest coconuts all day long.
Concerning installation options:
You do not have to use pip. The repo https://github.com/JorjMcKie/PyMuPDF-Optional-Material contains alternative ways: download the zip file fitting your config (which should obviously be https://github.com/JorjMcKie/PyMuPDF-Optional-Material/blob/1.11.2/binary_setups/pymupdf-1.11.2-py36-x64.zip), unzip it to e.g. your Desktop
and open a command prompt in the sub directory containing setup.py
. Then perform py setup.py install
.
As you can see, v1.11.2 is not the master branch in that repo yet (but will be shortly so). You can use it with no concern, however.
If you choose that path, be aware of the following:
When you open your cracked PDF, nothing bad will happen - not even an exception. However, the document will appear to have zero pages, and displaying the attributes doc.openErrCode, doc.openErrMsg
will show (2, 'cannot find startxref')
. So you can deduct this PDF has severe problems.
The good side of this is: you can still extract images from the (at least somehow opened) PDF! We have the utility extract_img2.py , which uses no pages for this. I let it run on my system and it created me over 100 png's from the broken PDF ...
I have submitted a bug at MuPDF's site. From past experience I have to say though, that a prompt response is not to be expected.
Another comment for using PyMuPDF:
In your intermediate solution, when you fitz.open
the PDF (after checking for a valid file EOF), you can pass in the bytes object b
directly, because we support memory resident documents: doc = fitz.open("pdf", b)
. This should save you a few milliseconds, because MuPDF itself won't need to access the disk (again).
Thanks! I made the change :-) 👍
You've also demonstrated something that I'm thoroughly enjoying about Python, even on my Pi..... It's so different than the world I have lived in where reading an entire file into RAM was considered a bad thing, or inefficient.
With Python and today's hardware, it's possible to work in a more natural feeling way. I've got 33 GB of RAM free at the moment. I don't think reading an entire PDF file into RAM is going to dent that in any way.
I will, however, take that efficiency you just gave me so that the file is only read once :-) I appreciate the tutorial. I based my code from the extract_img1.py code. I will try to use the other one you posted since it's more flexible in handling corrupt PDFs. For now I'm enjoying speeds that are BLISTERING compared to all of the "PDF Image Extraction" programs I've tried. 20 to 40 times faster at least.
Thanks again for the feedback!
Another difference between extract_img1 and extract_img2: The latter will encounter every image of the PDF only once. With extract_img1 it could be multiple times when multiple pages show the same picture (e.g. as a watermark or copyright stamp or something) ...
I'm really liking this Python thing! SO quick and easy to work with.
I dropped your extract_img2 code into my program and it's running through a test case now of 1,000
PDF files that resulted in about 430 kept images. I'm filtering out small images.
It's amazing how quickly it did this. 158 seconds in total to process all 1,000 PDFs.
It's QUICKER to learn Python, get your code, run on a data set....
than it is to buy a Windows program that extracts.
I would be waiting MANY HOURS to get through 1,000 PDFs.
Thanks again for the help, and thank you for creating this package to begin with!
I would like to suggest putting your two examples into a function and adding check for main at the bottom. You've written enough code in those examples that they can be called directly to extract images. I think people likely use your code in this way, but what do I know. Hmmmm...maybe an opportunity for me to CONTRIBUTE to github??
Indeed - so far I could only persuade one of our users to contribute. He created this Wiki page, which has since helped others to install. So, you are very welcome to suggest a change or something new via a pull request. And I am more than willing to discuss design decisions beforehand or whatever.
In the end, this is how the package will evolve. Just at the beginning of the week a Chinese user complained about missing support when creating new PDF pages with Chinese text on it ... that gave me an itch to really try accomplishing this ... that kept me busy until today, but it's there now.
OK! I'll sign up to making the changes to the two examples so that they are encapsulated and utilize the main module name check if you think that's a worthwhile thing to do. I know it's not much, but I don't have a ton of time to spend on this project. I'm working on my own stuff :-) But, I do want to give back. You enabled me to do some amazing stuff with very little effort on my part, saving me perhaps weeks or months of time given the 500,000 PDFs I'm processing.
Wow - half a million PDF's??!! What the ... are you working at?
Never mind if you only have time for a first sketch. Don't be shy to just sending me half-baked stuff via e-mail if you are hesitant publishing it right away. I am willing to join in polishing it up.
Please let me know how I can contact you directly via email....
The PDFs with problems have graduated. This next batch passes the awesome EOF check you gave me. It looks like I'll need some kind of lower lever patch to really get past this problem. Ideally the lower layers will raise an exception that ripples back to me.
I'm attaching one of them new batch of failures. MelanO Twisted - Aufsatz 6 mm Smaragd Grün Edelstahl GELBGOLD vergoldet.pdf
My e-mail is also mentioned in the repo readme: jorj.x.mckie@outlook.de.
I have had a look at that PDF in the meantime - again quite a crippled guy. I wonder what type of software created them in the first place.
Anyway, I modified extract_img2.py and PyMuPDF a bit again (both are currently uploading). The script now extracts 93 images and runs thru without exception, but of course with numerous error messages. These error messages result from
try / except
clauses where ever it failed before.The new script output looks like this:
error: expected object number
warning: repairing PDF document
error: invalid key in dict
warning: undefined link destination
file: MelanO Twisted - Aufsatz 6 mm Smaragd Grün Edelstahl GELBGOLD vergoldet.pdf, pages: 4, objects: 467
warning: ... repeated 8 times ...
error: invalid key in dict
error: broken PDF: xref is not a stream
error: broken PDF: xref is not a stream
run time 3.11
extracted images 93
You may not be aware of the following:
Images embedded in PDFs may be accompanied by a second quasi-image, which serves as a container for transparency attributes. If that's the case, the main image contains a pointer to that quasi-image via an /SMask
entry in its object definition. Then a true recreation of the original requires to combine the main image and its SMask.
PyMuPDF supports this case by taking the resp. pixmaps and using the SMask one as an alpha attribute modifier for the main one. This logic is contained in extract_img1.py, but not in extract_img2.py so far. To heal that, another regular expression search for /SMask
would be necessary, if found, take the following integer as an xref and create the mask pixmap from it ...
I have just received a reaction from MuPDF b/o my bug report. They concluded it must be a broken PDF and are asking for the example PDF. Any issue for you submitting 1960s GRUEN Airflight Vintage Pilot Aviators Military Time Jump Hour Watch.pdf
to them?
And thanks for the background of the work you are doing ... I felt somehow reminded of one of William Gibson's novels of the IDORU trilogy (a favorite writer of mine).
As per my comment on supporting masked images:
I couldn't find rest until I implemented /SMask
support into extract_img2.py as well. I am going to upload it in a moment. The pictures of the watches in the two (damaged) files now look much better with their masks!
OK to close the issue for now, @MikeTheWatchGuy ?
@MikeTheWatchGuy - fyi only: I have to ask MuPDF for pardon: your original problem goes back to an error inside PyMuPDF! After their response I had another, much deeper look in our code and found a place where we were missing to catch an MuPDF exception. The bugfix will be published with release v1.11.2 - the circumvention currently implemented in Windows binaries, which you are using, in essence does the same thing, so there is no effect on you.
Hi again! Sorry I haven't answered the 'close' question... glad you went ahead and closed :-)
I finally figured out my PIP problem. https://stackoverflow.com/questions/46499808/pip-throws-typeerror-parse-got-an-unexpected-keyword-argument-transport-enco
Tensorflow! Just needed to overwrite a bunch of files to get PIP operational again.
I just did a pip install --upgrade on pymu.
I was testing out my code again and found that I'm getting different results using the two different algorithms you supplied in the sample code.
I've integrated two of the solutions into my code. One loops through pages, the other uses the version that has the 'recover' function.
I was surprised to see that the one that uses pages has a great deal more pictures than the other. Upon investigation I found a number of the PDFs generated duplicate images. I'm attaching one of my PDFs. Can I bother you to try your version of the library and your test app to see if you get similar results when running the 2 techniques.
My paged based one finds 6, the other 12.
Thank you so much for my first real GitHub interaction on a project that's active. It's been such a great experience.
112655604373 VINTAGE GRUEN - PRECISION - 10K RGP White Gold Rectangular Oval Mens Watch.pdf
I'm SORRY!!
DOH!
I got myself mixed up. I think the OPPOSITE is happening.
The number of extracted files went down using the non-paged version. It's the paged version that returned duplicates. I need also to check to make sure my paged version has been updated to the latest. I put in the EOF check when we first started talking but did not continue to update that paged version despite the program that is on top of all this continued to use the older version.
I guess I should have switched to the non-paged version sooner!
It went from bad to worse in my post to you! And I had gain some credibility before :-(
My PDF DOES have duplicate images. -sigh-
I'm sorry for wasting your time
No reason at all for any concerns, I assure you! And its no waste of time either, because your feedback will help to harden PyMuPDF ...
I have done some checking on the last submitted PDF.
img1 script extracts 128 images, of which several are duplicates: same image referenced by multiple pages.
img2 extracts 190 (!) images
That img2 detects more than img1 is explainable in principle: img1 strictly looks at each page and only detects images referenced there. Any other image can never be found that way. In addition, img1 can only work if the PDF is healthy enough to possess a valid page tree (= array of page objects). Your very first PDF was an example where this is not the case, so img1 would totally fail there. img1 delivers results on a smaller set of files than img2.
But why there are so many more with img2, is not clear to me currently. If a PDF's page array is intact, and if the PDF has been garbage-collected: Who are those 62 images not referenced by any page? They must be purely technical objects (*), for which I haven't yet found a way to recognize and skip them. Please give me some more time for this.
Overall, img2 is the more stable and reliable script for your purpose - it shouldn't forget any image as long as a PDF is at least partly readable.
(*) "Technical" image object refers to images serving as transparency masks (e.g. the ones referenced via /SMask
) for the actual image of interest. Maybe img2 is not clever enough yet to detect all sorts of those ...
I've got tons of PDF files if you want them.
I'll put the old problem ones into a folder along with a bunch of new ones.
This will allow regression testing for those old boundary condition type failures.
I've put some checks in my code to filter out images that seem 'too large' or are 'too small'. Looking for those 'just right' ones :-)
I'm running another program when I finish extracting all the files that deletes the duplicates. Someday I'll write a little bit of Python code to rip through a folder and do a hash check on the files so I can delete them.
This project of mine has gone on too long for me to now put that in there too. I'll simply run a de-duplicator every week or two.
Hashing is an interesting thought! MuPDF internally creates a MD5 code for each image that is going to be stored in a PDF. An image with the same MD5 is treated as a duplicate and only a new reference is recorded ...
I am certainly interested in PDF test files (not half a million though, :-)) ...). But a decent mixture of OK and problem files in range of 20 or so would be great! You could use my e-mail for sending me a zip. Many thanks!
Oh hey! I didn't realize an SHA-5 was already being done on the file so maybe I can use the same library you are.
I need to do it on the images that are extracted rather than on the PDF.
Because my function processes an entire folder at a time, it's conceivable that I can keep track of all of hash codes as I processed and wrote each image to disk. Before writing an image, I can simply check my table to see if another image like it was previously written. I don't even need to know which filename clashed, just that it happened :-)
Exactly. And that's where my earlier mentioned pixmap method getPNGdata()
comes in handy: calculate a hash of this bytearray (or bytes?) and only write it to a png if it is a new hash.
Just as I suspected:
My filter for "technical' images was insufficient. this modified extract-img2.py
keeps track of all encountered /SMask
s and tries to delete them at end of script.
Now only 117 images are extracted! (or left over) Instead of 164 ...
The 128 images of img1 are a true subset of those 117, if you kick out the multiples from several pages.
#! python
'''
This demo extracts all images of a PDF as PNG files, whether they are
referenced by pages or not.
It scans through all objects and selects /Type/XObject with /Subtype/Image.
So runtime is determined by number of objects and image volume.
Usage:
extract_img2.py input.pdf
'''
from __future__ import print_function
import fitz
import os, sys, time, re
def recoverpix(doc, item):
x = item[0] # xref of PDF image
s = item[1] # xref of its /SMask
try:
pix1 = fitz.Pixmap(doc, x) # make pixmap from image
except:
print("xref %i " % x + doc._getGCTXerrmsg())
return None # skip if error
if s == 0: # has no /SMask
return pix1 # no special handling
try:
pix2 = fitz.Pixmap(doc, s) # create pixmap of /SMask entry
except:
print("cannot create mask %i for image xref %i" % (s,x))
return pix1
# check that we are safe
if not (pix1.irect == pix2.irect and \
pix1.alpha == pix2.alpha == 0 and \
pix2.n == 1):
print("unexpected /SMask situation: pix1", pix1, "pix2", pix2)
return pix1
pix = fitz.Pixmap(pix1) # copy of pix1, alpha channel added
pix.setAlpha(pix2.samples) # treat pix2.samples as alpha value
pix1 = pix2 = None # free temp pixmaps
return pix
checkXO = r"/Type(?= */XObject)" # finds "/Type/XObject"
checkIM = r"/Subtype(?= */Image)" # finds "/Subtype/Image"
assert len(sys.argv) == 2, 'Usage: %s <input file>' % sys.argv[0]
t0 = time.clock()
doc = fitz.open(sys.argv[1])
imgcount = 0
lenXREF = doc._getXrefLength() # number of objects - do not use entry 0!
# display some file info
print(__file__, "PDF: %s, pages: %s, objects: %s" % (sys.argv[1], len(doc), lenXREF-1))
smasks = [] # list of smask image xrefs
for i in range(1, lenXREF): # scan through all objects
try:
text = doc._getObjectString(i) # PDF object definition string
except:
print("xref %i " % i + doc._getGCTXerrmsg())
continue # skip if error
isXObject = re.search(checkXO, text) # tests for XObject
isImage = re.search(checkIM, text) # tests for Image
if not isXObject or not isImage: # not an image object if not both True
continue
txt = text.split("/SMask")
if len(txt) > 1:
y = txt[1].split()
mxref = int(y[0])
smasks.append(mxref) # never mind duplicate appending
else:
mxref = 0
pix = recoverpix(doc, (i, mxref))
if not pix:
continue
if not pix.colorspace: # an error a just a mask!
continue
imgcount += 1
if pix.colorspace.n < 4: # can be saved as PNG
pix.writePNG("img-%i.png" % (i,))
else: # CMYK: must convert it
pix0 = fitz.Pixmap(fitz.csRGB, pix)
pix0.writePNG("img-%i.png" % (i,))
pix0 = None # free Pixmap resources
pix = None # free Pixmap resources
# now delete any /SMask files not filtered out before
removed = 0
for xref in smasks:
fn = "img-%i.png" % xref
if os.path.exists(fn):
os.remove(fn)
removed += 1
t1 = time.clock()
print("run time", round(t1-t0, 2))
print("extracted images", (imgcount - removed))
With this script all duplicate or small files of a directory of PNGs are removed in blinding speed. Only 34 images from the above PDF survived this process.
import hashlib
import os, sys
pngdir = sys.argv[1] # where the PNGs live
pngfiles = os.listdir(pngdir)
shatab = []
dups = 0
total = 0
small = 2048 # file size limit
for f in pngfiles:
if not f.endswith(".png"):
continue
total += 1
fname = os.path.join(pngdir, f)
x = open(fname, "rb").read()
m = hashlib.sha256()
m.update(x)
f_sha = m.digest()
if f_sha in shatab or len(x) <= small:
os.remove(fname)
dups += 1
else:
shatab.append(f_sha)
print("Removed %i duplicate or small files from a total of %i." % (dups, total))
I think we can close this issue again now ...
You are SO amazingly AWESOME Jorj!!!
Damn, I was going to write one of these.
You need to release this on GitHub! Maybe clean it up for release 😊 but it’s very powerful and should be available to simply copy, paste, and use.
This Python trend of solving a problem by searching for it has been extraordinary.
I need to understand, for example more about exceptions. I know already my question has been asked and that someone has posted a VERY elegant solution that I can learn from AND immediately use in my code.
For a seasoned programmer, the boost in productivity is something I never saw coming.
However, I also see people throwing together this code without understanding a single thing about it. Which is comforting in many ways as I am fortunate enough to have a solid education in design and programming… just as YOU do too! LOL
Thanks much! I’ll be stealing this from you too, thank you
-mike
From: Jorj X. McKie [mailto:notifications@github.com] Sent: Sunday, November 26, 2017 5:18 AM To: rk700/PyMuPDF PyMuPDF@noreply.github.com Cc: MikeTheWatchGuy mike_barnett@hotmail.com; Mention mention@noreply.github.com Subject: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)
With this script all duplicate or small files of a directory of PNGs are removed in blinding speed. Only 34 images from the above PDF survived this process.
import hashlib
import os, sys
pngdir = sys.argv[1] # where the PNGs live
pngfiles = os.listdir(pngdir)
shatab = []
dups = 0
total = 0
small = 2048 # file size limit
for f in pngfiles:
if not f.endswith(".png"):
continue
total += 1
fname = os.path.join(pngdir, f)
x = open(fname, "rb").read()
m = hashlib.sha256()
m.update(x)
f_sha = m.digest()
if f_sha in shatab or len(x) <= small:
os.remove(fname)
dups += 1
else:
shatab.append(f_sha)
print("Removed %i duplicate or small files from a total of %i." % (dups, total))
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frk700%2FPyMuPDF%2Fissues%2F105%23issuecomment-346997608&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=D%2Big0W4G%2FYyCLHWoQHDne%2BUgRTken74ml53xN6qKJCU%3D&reserved=0, or mute the threadhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAND8wTIoRKjDKjdD2BJpA6MtotUFdRC3ks5s6TrigaJpZM4Qhm-5&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=JXlCMISYqje%2FiBxzy7n1pPh4npgTVT%2FZ5f1LxLB1DbI%3D&reserved=0.
Thank you again, you are very flattering, indeed!
I wanted to come back and discuss your problem concerning how to recognize irrelevant PNGs, i.e. images not showing photos of watches, but arbitrary graphical artifacts (EBAY marketing messages, screen control buttons and more stuff like that).
A very useful and very simple filter is, that the picture dimensions should be large enough to accept it as a photo. I set that to 100 pixels, which is probably on the low side.
The other recent updates to extract-img.py follow the implicit argument, that photos should be distinguishable from other graphics by an adequately defined “complexity” – with reasonable certainty (i.e. what is more harmful: omitting a valuable photo or retaining a useless image …). So I implemented the hypothesis, that a real-world photo should show more resistance against compression methods like that of the PNG file format. I am therefore taking the quotient of PNG-size vs. plain-pixel-size as my complexity measure. What I found looking at your (beautiful!) watch images: this quotient is never below 0.20 for any watch photo. In most cases it is a lot higher (0.3 to 0.4). Therefore, using this as a filter criterion should be all: fast (performance 😊!), easy to implement and corect. I tried several thresholds between 0.01 and 0.05 (corresponding to compression ratios of 100:1 to 20:1, respectively). The value 0.05 seems to be (in my / your 17 PDF cases) high enough and safe enough.
The implementaion is very simple, PyMuPDF has it all: it just is the value len(pix.getPNGData()) / len(pix.samples). If a picture passes this test, then pix.getPNGData() can be directly used as the content of a file opened as binary with ofile.write() to create the PNG file.
Von: MikeTheWatchGuy [mailto:notifications@github.com] Gesendet: Sonntag, 26. November 2017 15:14 An: rk700/PyMuPDF PyMuPDF@noreply.github.com Cc: Jorj X. McKie jorj.x.mckie@outlook.de; State change state_change@noreply.github.com Betreff: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)
You are SO amazingly AWESOME Jorj!!!
Damn, I was going to write one of these.
You need to release this on GitHub! Maybe clean it up for release 😊 but it’s very powerful and should be available to simply copy, paste, and use.
This Python trend of solving a problem by searching for it has been extraordinary.
I need to understand, for example more about exceptions. I know already my question has been asked and that someone has posted a VERY elegant solution that I can learn from AND immediately use in my code.
For a seasoned programmer, the boost in productivity is something I never saw coming.
However, I also see people throwing together this code without understanding a single thing about it. Which is comforting in many ways as I am fortunate enough to have a solid education in design and programming… just as YOU do too! LOL
Thanks much! I’ll be stealing this from you too, thank you
-mike
From: Jorj X. McKie [mailto:notifications@github.com] Sent: Sunday, November 26, 2017 5:18 AM To: rk700/PyMuPDF PyMuPDF@noreply.github.com<mailto:PyMuPDF@noreply.github.com> Cc: MikeTheWatchGuy mike_barnett@hotmail.com<mailto:mike_barnett@hotmail.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)
With this script all duplicate or small files of a directory of PNGs are removed in blinding speed. Only 34 images from the above PDF survived this process.
import hashlib
import os, sys
pngdir = sys.argv[1] # where the PNGs live
pngfiles = os.listdir(pngdir)
shatab = []
dups = 0
total = 0
small = 2048 # file size limit
for f in pngfiles:
if not f.endswith(".png"):
continue
total += 1
fname = os.path.join(pngdir, f)
x = open(fname, "rb").read()
m = hashlib.sha256()
m.update(x)
f_sha = m.digest()
if f_sha in shatab or len(x) <= small:
os.remove(fname)
dups += 1
else:
shatab.append(f_sha)
print("Removed %i duplicate or small files from a total of %i." % (dups, total))
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frk700%2FPyMuPDF%2Fissues%2F105%23issuecomment-346997608&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=D%2Big0W4G%2FYyCLHWoQHDne%2BUgRTken74ml53xN6qKJCU%3D&reserved=0, or mute the threadhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAND8wTIoRKjDKjdD2BJpA6MtotUFdRC3ks5s6TrigaJpZM4Qhm-5&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=JXlCMISYqje%2FiBxzy7n1pPh4npgTVT%2FZ5f1LxLB1DbI%3D&reserved=0.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/rk700/PyMuPDF/issues/105#issuecomment-347031085, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH6BogBwlyZxfMnbDIR1hBq-obaKNcCOks5s6bhygaJpZM4Qhm-5.
Wow you went all-in
I’m finding that the image size is the best discriminator and I’m not getting a bunch of trash files… thanks to your awesome code!
However, I will look at your suggestion. I need to also incorporate that delete duplicates code. I’m excited to have that!
I understand why other packages would want to use your code for PDF files. It’s FAST, accurate, and really does a great job with no headaches. You should be proud 😊
From: Jorj X. McKie [mailto:notifications@github.com] Sent: Sunday, November 26, 2017 3:12 PM To: rk700/PyMuPDF PyMuPDF@noreply.github.com Cc: MikeTheWatchGuy mike_barnett@hotmail.com; Mention mention@noreply.github.com Subject: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)
Thank you again, you are very flattering, indeed!
I wanted to come back and discuss your problem concerning how to recognize irrelevant PNGs, i.e. images not showing photos of watches, but arbitrary graphical artifacts (EBAY marketing messages, screen control buttons and more stuff like that).
A very useful and very simple filter is, that the picture dimensions should be large enough to accept it as a photo. I set that to 100 pixels, which is probably on the low side.
The other recent updates to extract-img.py follow the implicit argument, that photos should be distinguishable from other graphics by an adequately defined “complexity” – with reasonable certainty (i.e. what is more harmful: omitting a valuable photo or retaining a useless image …). So I implemented the hypothesis, that a real-world photo should show more resistance against compression methods like that of the PNG file format. I am therefore taking the quotient of PNG-size vs. plain-pixel-size as my complexity measure. What I found looking at your (beautiful!) watch images: this quotient is never below 0.20 for any watch photo. In most cases it is a lot higher (0.3 to 0.4). Therefore, using this as a filter criterion should be all: fast (performance 😊!), easy to implement and corect. I tried several thresholds between 0.01 and 0.05 (corresponding to compression ratios of 100:1 to 20:1, respectively). The value 0.05 seems to be (in my / your 17 PDF cases) high enough and safe enough.
The implementaion is very simple, PyMuPDF has it all: it just is the value len(pix.getPNGData()) / len(pix.samples). If a picture passes this test, then pix.getPNGData() can be directly used as the content of a file opened as binary with ofile.write() to create the PNG file.
Von: MikeTheWatchGuy [mailto:notifications@github.com] Gesendet: Sonntag, 26. November 2017 15:14 An: rk700/PyMuPDF PyMuPDF@noreply.github.com<mailto:PyMuPDF@noreply.github.com> Cc: Jorj X. McKie jorj.x.mckie@outlook.de<mailto:jorj.x.mckie@outlook.de>; State change state_change@noreply.github.com<mailto:state_change@noreply.github.com> Betreff: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)
You are SO amazingly AWESOME Jorj!!!
Damn, I was going to write one of these.
You need to release this on GitHub! Maybe clean it up for release 😊 but it’s very powerful and should be available to simply copy, paste, and use.
This Python trend of solving a problem by searching for it has been extraordinary.
I need to understand, for example more about exceptions. I know already my question has been asked and that someone has posted a VERY elegant solution that I can learn from AND immediately use in my code.
For a seasoned programmer, the boost in productivity is something I never saw coming.
However, I also see people throwing together this code without understanding a single thing about it. Which is comforting in many ways as I am fortunate enough to have a solid education in design and programming… just as YOU do too! LOL
Thanks much! I’ll be stealing this from you too, thank you
-mike
From: Jorj X. McKie [mailto:notifications@github.com] Sent: Sunday, November 26, 2017 5:18 AM To: rk700/PyMuPDF PyMuPDF@noreply.github.com<mailto:PyMuPDF@noreply.github.com<mailto:PyMuPDF@noreply.github.com%3cmailto:PyMuPDF@noreply.github.com>> Cc: MikeTheWatchGuy mike_barnett@hotmail.com<mailto:mike_barnett@hotmail.com<mailto:mike_barnett@hotmail.com%3cmailto:mike_barnett@hotmail.com>>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com<mailto:mention@noreply.github.com%3cmailto:mention@noreply.github.com>> Subject: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)
With this script all duplicate or small files of a directory of PNGs are removed in blinding speed. Only 34 images from the above PDF survived this process.
import hashlib
import os, sys
pngdir = sys.argv[1] # where the PNGs live
pngfiles = os.listdir(pngdir)
shatab = []
dups = 0
total = 0
small = 2048 # file size limit
for f in pngfiles:
if not f.endswith(".png"):
continue
total += 1
fname = os.path.join(pngdir, f)
x = open(fname, "rb").read()
m = hashlib.sha256()
m.update(x)
f_sha = m.digest()
if f_sha in shatab or len(x) <= small:
os.remove(fname)
dups += 1
else:
shatab.append(f_sha)
print("Removed %i duplicate or small files from a total of %i." % (dups, total))
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frk700%2FPyMuPDF%2Fissues%2F105%23issuecomment-346997608&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=D%2Big0W4G%2FYyCLHWoQHDne%2BUgRTken74ml53xN6qKJCU%3D&reserved=0, or mute the threadhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAND8wTIoRKjDKjdD2BJpA6MtotUFdRC3ks5s6TrigaJpZM4Qhm-5&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=JXlCMISYqje%2FiBxzy7n1pPh4npgTVT%2FZ5f1LxLB1DbI%3D&reserved=0.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/rk700/PyMuPDF/issues/105#issuecomment-347031085, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH6BogBwlyZxfMnbDIR1hBq-obaKNcCOks5s6bhygaJpZM4Qhm-5.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frk700%2FPyMuPDF%2Fissues%2F105%23issuecomment-347034819&data=02%7C01%7Cmike_barnett%40hotmail.com%7C81eee083b8104e479dd608d53509f7b8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636473239282377784&sdata=RyouhbGIlRXOVaiwn3nuJryXzwpeS4xNh4BkLhIrxto%3D&reserved=0, or mute the threadhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAND8wWrX2rosdZy5p-EdZn05wbpkfNL4ks5s6cYVgaJpZM4Qhm-5&data=02%7C01%7Cmike_barnett%40hotmail.com%7C81eee083b8104e479dd608d53509f7b8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636473239282377784&sdata=ZmrCpJ3GigabMyQxSyzGa0RxHhCyETsoMfONRmWicpM%3D&reserved=0.
My open call is quite simply: doc = fitz.open(input_path + '\' + file)
I'm experiencing crashes when corrupt PDF files are encountered. I would be happy to pre-screen them if I knew what to look for. I assumed that fitz.open would raise and exception that's passed back to me but instead it's crashing with this information: error: cannot find startxref warning: trying to repair broken xref warning: repairing PDF document warning: object missing 'endobj' token error: non-page object in page tree uncaught exception: non-page object in page tree
I'm attaching my PDF file that is causing the trouble. I'm using this code to extract images from a large number of PDF files that I've generated using WKHTMLTOPDF. I'm unsure why a few of them are corrupt. I'm working on that end of things.
Is there a different way I can call the open that will cause the exception to be passed back to me so that I can skip the file and move on to the next?
Thank you for your time. 1960s GRUEN Airflight Vintage Pilot Aviators Military Time Jump Hour Watch.pdf