pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.3k stars 506 forks source link

Fitz.open is not passing back exception #105

Closed MikeTheWatchGuy closed 6 years ago

MikeTheWatchGuy commented 6 years ago

My open call is quite simply: doc = fitz.open(input_path + '\' + file)

I'm experiencing crashes when corrupt PDF files are encountered. I would be happy to pre-screen them if I knew what to look for. I assumed that fitz.open would raise and exception that's passed back to me but instead it's crashing with this information: error: cannot find startxref warning: trying to repair broken xref warning: repairing PDF document warning: object missing 'endobj' token error: non-page object in page tree uncaught exception: non-page object in page tree

I'm attaching my PDF file that is causing the trouble. I'm using this code to extract images from a large number of PDF files that I've generated using WKHTMLTOPDF. I'm unsure why a few of them are corrupt. I'm working on that end of things.

Is there a different way I can call the open that will cause the exception to be passed back to me so that I can skip the file and move on to the next?

Thank you for your time. 1960s GRUEN Airflight Vintage Pilot Aviators Military Time Jump Hour Watch.pdf

JorjMcKie commented 6 years ago

I will look into it. In the meantime please let me know your configuration (OS, PyMuPDF version, bitness, Python version etc.). Thanks!

JorjMcKie commented 6 years ago

Hm - it seems you have detected an error in the underlying C library MuPDF. Of course they normally catch broken documents and are able to recover from more situations than many other products. But the damage to your document is unfortunate enough to exhibit an uncovered situation there. I will file a bug in their system. Is there anything in your file that would prohibit sending it to them (data protection)? Depending on your urgency, I could develop a quick and dirty circumvention in the following way:

try:
    b = open("file.pdf", "rb").read()
    doc = fitz.open("pdf", b)
    # do something with doc
except:
    # handle invalid document here

This would give me some control over the logic flow. You could of course do something similar: Your document's tail is cut off - it looks just fine until it prematurely ends. It does not end with characters %%EOF<LF>, which is mandatory. So this is easily detectable: if not b.endswith("%%EOF\n"): -> error.

Please let me know your reaction.

JorjMcKie commented 6 years ago

Another investigation result: I found the place, where MuPDF fails to catch an exception. After correcting this in their C code, re-generating MuPDF and PyMuPDF, opening your error document now leads to a proper exception in Python.

Again my question: what is your configuration?

MikeTheWatchGuy commented 6 years ago

Wow you JUMPED on this but quick!! I'm mega impressed. Thank you for the IMMEDIATE fix to my own code, the ability to check for an illegal file so I don't make the call.

A larger view of my call to open is this:

        try:
            doc = fitz.open(input_path + '\\' + file)
        except:
            continue

Sorry I'm slow in comparison...

My setup: Pycharm Anaconda distribution of Python 3.6.2 (Sept 19 2017)\ Pip list tells me that I have 'fitz 1.11.1' Windows 10 Core i9 CPU 64 GB RAM (48 free)

I'll add the checking for EOF as that's such a simple thing and will likely solve my issue. I'll also put in any fixes that other people add and see if I am able then to catch the exceptions instead of crash. Thank you! -mb (BTW... having some issues installing packages using pip on windows. was working fine. Now have to download and run setup instead of pip install)

MikeTheWatchGuy commented 6 years ago

My code now looks like this and is humming along.... thanks to your help!

        with open(input_path + '\\' + file, "rb") as f:             # see if the file is a legal PDF file
            b = f.read()
            if not b.endswith(b"%%EOF\n"):                          # quick PDF file validation using end-of-file marker
                print('* file does not end in EOF *   skipping file {}'.format(file))
                continue
         # =================== Open the PDF file ====================#
        try:
            doc = fitz.open(input_path + '\\' + file)
        except:
            continue

I'm sure this could be written in a better or clearer way. I'm only a couple months with Python.

JorjMcKie commented 6 years ago

Thanks for the flattering compliments :-) I have to admit, I am a seasoned guy living in Venezuela after my retirement ... so I do have enough time. After all I can't be in the swimming pool or harvest coconuts all day long.

Concerning installation options:

You do not have to use pip. The repo https://github.com/JorjMcKie/PyMuPDF-Optional-Material contains alternative ways: download the zip file fitting your config (which should obviously be https://github.com/JorjMcKie/PyMuPDF-Optional-Material/blob/1.11.2/binary_setups/pymupdf-1.11.2-py36-x64.zip), unzip it to e.g. your Desktop and open a command prompt in the sub directory containing setup.py. Then perform py setup.py install.

As you can see, v1.11.2 is not the master branch in that repo yet (but will be shortly so). You can use it with no concern, however.

If you choose that path, be aware of the following: When you open your cracked PDF, nothing bad will happen - not even an exception. However, the document will appear to have zero pages, and displaying the attributes doc.openErrCode, doc.openErrMsg will show (2, 'cannot find startxref'). So you can deduct this PDF has severe problems.

The good side of this is: you can still extract images from the (at least somehow opened) PDF! We have the utility extract_img2.py , which uses no pages for this. I let it run on my system and it created me over 100 png's from the broken PDF ...

I have submitted a bug at MuPDF's site. From past experience I have to say though, that a prompt response is not to be expected.

JorjMcKie commented 6 years ago

Another comment for using PyMuPDF: In your intermediate solution, when you fitz.open the PDF (after checking for a valid file EOF), you can pass in the bytes object b directly, because we support memory resident documents: doc = fitz.open("pdf", b). This should save you a few milliseconds, because MuPDF itself won't need to access the disk (again).

MikeTheWatchGuy commented 6 years ago

Thanks! I made the change :-) 👍

You've also demonstrated something that I'm thoroughly enjoying about Python, even on my Pi..... It's so different than the world I have lived in where reading an entire file into RAM was considered a bad thing, or inefficient.

With Python and today's hardware, it's possible to work in a more natural feeling way. I've got 33 GB of RAM free at the moment. I don't think reading an entire PDF file into RAM is going to dent that in any way.

I will, however, take that efficiency you just gave me so that the file is only read once :-) I appreciate the tutorial. I based my code from the extract_img1.py code. I will try to use the other one you posted since it's more flexible in handling corrupt PDFs. For now I'm enjoying speeds that are BLISTERING compared to all of the "PDF Image Extraction" programs I've tried. 20 to 40 times faster at least.

JorjMcKie commented 6 years ago

Thanks again for the feedback!

Another difference between extract_img1 and extract_img2: The latter will encounter every image of the PDF only once. With extract_img1 it could be multiple times when multiple pages show the same picture (e.g. as a watermark or copyright stamp or something) ...

MikeTheWatchGuy commented 6 years ago

I'm really liking this Python thing! SO quick and easy to work with.

I dropped your extract_img2 code into my program and it's running through a test case now of 1,000

PDF files that resulted in about 430 kept images. I'm filtering out small images.

It's amazing how quickly it did this. 158 seconds in total to process all 1,000 PDFs.

It's QUICKER to learn Python, get your code, run on a data set.... than it is to buy a Windows program that extracts.
I would be waiting MANY HOURS to get through 1,000 PDFs.

Thanks again for the help, and thank you for creating this package to begin with!

I would like to suggest putting your two examples into a function and adding check for main at the bottom. You've written enough code in those examples that they can be called directly to extract images. I think people likely use your code in this way, but what do I know. Hmmmm...maybe an opportunity for me to CONTRIBUTE to github??

JorjMcKie commented 6 years ago

Indeed - so far I could only persuade one of our users to contribute. He created this Wiki page, which has since helped others to install. So, you are very welcome to suggest a change or something new via a pull request. And I am more than willing to discuss design decisions beforehand or whatever.

In the end, this is how the package will evolve. Just at the beginning of the week a Chinese user complained about missing support when creating new PDF pages with Chinese text on it ... that gave me an itch to really try accomplishing this ... that kept me busy until today, but it's there now.

MikeTheWatchGuy commented 6 years ago

OK! I'll sign up to making the changes to the two examples so that they are encapsulated and utilize the main module name check if you think that's a worthwhile thing to do. I know it's not much, but I don't have a ton of time to spend on this project. I'm working on my own stuff :-) But, I do want to give back. You enabled me to do some amazing stuff with very little effort on my part, saving me perhaps weeks or months of time given the 500,000 PDFs I'm processing.

JorjMcKie commented 6 years ago

Wow - half a million PDF's??!! What the ... are you working at?

Never mind if you only have time for a first sketch. Don't be shy to just sending me half-baked stuff via e-mail if you are hesitant publishing it right away. I am willing to join in polishing it up.

MikeTheWatchGuy commented 6 years ago

Please let me know how I can contact you directly via email....

The PDFs with problems have graduated. This next batch passes the awesome EOF check you gave me. It looks like I'll need some kind of lower lever patch to really get past this problem. Ideally the lower layers will raise an exception that ripples back to me.

I'm attaching one of them new batch of failures. MelanO Twisted - Aufsatz 6 mm Smaragd Grün Edelstahl GELBGOLD vergoldet.pdf

JorjMcKie commented 6 years ago

My e-mail is also mentioned in the repo readme: jorj.x.mckie@outlook.de.

I have had a look at that PDF in the meantime - again quite a crippled guy. I wonder what type of software created them in the first place.

Anyway, I modified extract_img2.py and PyMuPDF a bit again (both are currently uploading). The script now extracts 93 images and runs thru without exception, but of course with numerous error messages. These error messages result from

  1. MuPDF during open, which is recovering from them in this PDF case
  2. the extract script. I have inserted try / except clauses where ever it failed before.

The new script output looks like this:

error: expected object number
warning: repairing PDF document
error: invalid key in dict
warning: undefined link destination
file: MelanO Twisted - Aufsatz 6 mm Smaragd Grün Edelstahl GELBGOLD vergoldet.pdf, pages: 4, objects: 467
warning: ... repeated 8 times ...
error: invalid key in dict
error: broken PDF: xref is not a stream
error: broken PDF: xref is not a stream
run time 3.11
extracted images 93
JorjMcKie commented 6 years ago

You may not be aware of the following:

Images embedded in PDFs may be accompanied by a second quasi-image, which serves as a container for transparency attributes. If that's the case, the main image contains a pointer to that quasi-image via an /SMask entry in its object definition. Then a true recreation of the original requires to combine the main image and its SMask.

PyMuPDF supports this case by taking the resp. pixmaps and using the SMask one as an alpha attribute modifier for the main one. This logic is contained in extract_img1.py, but not in extract_img2.py so far. To heal that, another regular expression search for /SMask would be necessary, if found, take the following integer as an xref and create the mask pixmap from it ...

JorjMcKie commented 6 years ago

I have just received a reaction from MuPDF b/o my bug report. They concluded it must be a broken PDF and are asking for the example PDF. Any issue for you submitting 1960s GRUEN Airflight Vintage Pilot Aviators Military Time Jump Hour Watch.pdf to them?

JorjMcKie commented 6 years ago

And thanks for the background of the work you are doing ... I felt somehow reminded of one of William Gibson's novels of the IDORU trilogy (a favorite writer of mine).

JorjMcKie commented 6 years ago

As per my comment on supporting masked images:

I couldn't find rest until I implemented /SMask support into extract_img2.py as well. I am going to upload it in a moment. The pictures of the watches in the two (damaged) files now look much better with their masks!

JorjMcKie commented 6 years ago

OK to close the issue for now, @MikeTheWatchGuy ?

JorjMcKie commented 6 years ago

@MikeTheWatchGuy - fyi only: I have to ask MuPDF for pardon: your original problem goes back to an error inside PyMuPDF! After their response I had another, much deeper look in our code and found a place where we were missing to catch an MuPDF exception. The bugfix will be published with release v1.11.2 - the circumvention currently implemented in Windows binaries, which you are using, in essence does the same thing, so there is no effect on you.

MikeTheWatchGuy commented 6 years ago

Hi again! Sorry I haven't answered the 'close' question... glad you went ahead and closed :-)

I finally figured out my PIP problem. https://stackoverflow.com/questions/46499808/pip-throws-typeerror-parse-got-an-unexpected-keyword-argument-transport-enco

Tensorflow! Just needed to overwrite a bunch of files to get PIP operational again.

I just did a pip install --upgrade on pymu.

I was testing out my code again and found that I'm getting different results using the two different algorithms you supplied in the sample code.

I've integrated two of the solutions into my code. One loops through pages, the other uses the version that has the 'recover' function.

I was surprised to see that the one that uses pages has a great deal more pictures than the other. Upon investigation I found a number of the PDFs generated duplicate images. I'm attaching one of my PDFs. Can I bother you to try your version of the library and your test app to see if you get similar results when running the 2 techniques.

My paged based one finds 6, the other 12.

Thank you so much for my first real GitHub interaction on a project that's active. It's been such a great experience.

112655604373 VINTAGE GRUEN - PRECISION - 10K RGP White Gold Rectangular Oval Mens Watch.pdf

MikeTheWatchGuy commented 6 years ago

I'm SORRY!!

DOH!

I got myself mixed up. I think the OPPOSITE is happening.

The number of extracted files went down using the non-paged version. It's the paged version that returned duplicates. I need also to check to make sure my paged version has been updated to the latest. I put in the EOF check when we first started talking but did not continue to update that paged version despite the program that is on top of all this continued to use the older version.

I guess I should have switched to the non-paged version sooner!

MikeTheWatchGuy commented 6 years ago

It went from bad to worse in my post to you! And I had gain some credibility before :-(

My PDF DOES have duplicate images. -sigh-

I'm sorry for wasting your time

JorjMcKie commented 6 years ago

No reason at all for any concerns, I assure you! And its no waste of time either, because your feedback will help to harden PyMuPDF ...

I have done some checking on the last submitted PDF.

That img2 detects more than img1 is explainable in principle: img1 strictly looks at each page and only detects images referenced there. Any other image can never be found that way. In addition, img1 can only work if the PDF is healthy enough to possess a valid page tree (= array of page objects). Your very first PDF was an example where this is not the case, so img1 would totally fail there. img1 delivers results on a smaller set of files than img2.

But why there are so many more with img2, is not clear to me currently. If a PDF's page array is intact, and if the PDF has been garbage-collected: Who are those 62 images not referenced by any page? They must be purely technical objects (*), for which I haven't yet found a way to recognize and skip them. Please give me some more time for this.

Overall, img2 is the more stable and reliable script for your purpose - it shouldn't forget any image as long as a PDF is at least partly readable.


(*) "Technical" image object refers to images serving as transparency masks (e.g. the ones referenced via /SMask) for the actual image of interest. Maybe img2 is not clever enough yet to detect all sorts of those ...

MikeTheWatchGuy commented 6 years ago

I've got tons of PDF files if you want them.

I'll put the old problem ones into a folder along with a bunch of new ones.

This will allow regression testing for those old boundary condition type failures.

I've put some checks in my code to filter out images that seem 'too large' or are 'too small'. Looking for those 'just right' ones :-)

I'm running another program when I finish extracting all the files that deletes the duplicates. Someday I'll write a little bit of Python code to rip through a folder and do a hash check on the files so I can delete them.

This project of mine has gone on too long for me to now put that in there too. I'll simply run a de-duplicator every week or two.

JorjMcKie commented 6 years ago

Hashing is an interesting thought! MuPDF internally creates a MD5 code for each image that is going to be stored in a PDF. An image with the same MD5 is treated as a duplicate and only a new reference is recorded ...

I am certainly interested in PDF test files (not half a million though, :-)) ...). But a decent mixture of OK and problem files in range of 20 or so would be great! You could use my e-mail for sending me a zip. Many thanks!

MikeTheWatchGuy commented 6 years ago

Oh hey! I didn't realize an SHA-5 was already being done on the file so maybe I can use the same library you are.

I need to do it on the images that are extracted rather than on the PDF.

Because my function processes an entire folder at a time, it's conceivable that I can keep track of all of hash codes as I processed and wrote each image to disk. Before writing an image, I can simply check my table to see if another image like it was previously written. I don't even need to know which filename clashed, just that it happened :-)

JorjMcKie commented 6 years ago

Exactly. And that's where my earlier mentioned pixmap method getPNGdata() comes in handy: calculate a hash of this bytearray (or bytes?) and only write it to a png if it is a new hash.

JorjMcKie commented 6 years ago

Just as I suspected: My filter for "technical' images was insufficient. this modified extract-img2.py keeps track of all encountered /SMasks and tries to delete them at end of script. Now only 117 images are extracted! (or left over) Instead of 164 ... The 128 images of img1 are a true subset of those 117, if you kick out the multiples from several pages.

#! python
'''
This demo extracts all images of a PDF as PNG files, whether they are
referenced by pages or not.
It scans through all objects and selects /Type/XObject with /Subtype/Image.
So runtime is determined by number of objects and image volume.
Usage:
extract_img2.py input.pdf
'''
from __future__ import print_function
import fitz
import os, sys, time, re

def recoverpix(doc, item):
    x = item[0]  # xref of PDF image
    s = item[1]  # xref of its /SMask

    try:
        pix1 = fitz.Pixmap(doc, x)     # make pixmap from image
    except:
        print("xref %i " % x + doc._getGCTXerrmsg())
        return None                    # skip if error

    if s == 0:                    # has no /SMask
        return pix1               # no special handling

    try:
        pix2 = fitz.Pixmap(doc, s)    # create pixmap of /SMask entry
    except:
        print("cannot create mask %i for image xref %i" % (s,x))
        return pix1

    # check that we are safe
    if not (pix1.irect == pix2.irect and \
            pix1.alpha == pix2.alpha == 0 and \
            pix2.n == 1):
        print("unexpected /SMask situation: pix1", pix1, "pix2", pix2)
        return pix1
    pix = fitz.Pixmap(pix1)       # copy of pix1, alpha channel added
    pix.setAlpha(pix2.samples)    # treat pix2.samples as alpha value
    pix1 = pix2 = None            # free temp pixmaps
    return pix

checkXO = r"/Type(?= */XObject)"       # finds "/Type/XObject"   
checkIM = r"/Subtype(?= */Image)"      # finds "/Subtype/Image"

assert len(sys.argv) == 2, 'Usage: %s <input file>' % sys.argv[0]

t0 = time.clock()
doc = fitz.open(sys.argv[1])
imgcount = 0
lenXREF = doc._getXrefLength()         # number of objects - do not use entry 0!

# display some file info
print(__file__, "PDF: %s, pages: %s, objects: %s" % (sys.argv[1], len(doc), lenXREF-1))

smasks = []   # list of smask image xrefs

for i in range(1, lenXREF):            # scan through all objects
    try:
        text = doc._getObjectString(i) # PDF object definition string
    except:
        print("xref %i " % i + doc._getGCTXerrmsg())
        continue                       # skip if error

    isXObject = re.search(checkXO, text)    # tests for XObject
    isImage   = re.search(checkIM, text)    # tests for Image
    if not isXObject or not isImage:   # not an image object if not both True
        continue

    txt = text.split("/SMask")
    if len(txt) > 1:
        y = txt[1].split()
        mxref = int(y[0])
        smasks.append(mxref)  # never mind duplicate appending
    else:
        mxref = 0

    pix = recoverpix(doc, (i, mxref))

    if not pix:
        continue
    if not pix.colorspace:             # an error a just a mask!
        continue

    imgcount += 1
    if pix.colorspace.n < 4:           # can be saved as PNG
        pix.writePNG("img-%i.png" % (i,))
    else:                              # CMYK: must convert it
        pix0 = fitz.Pixmap(fitz.csRGB, pix)
        pix0.writePNG("img-%i.png" % (i,))
        pix0 = None                    # free Pixmap resources
    pix = None                         # free Pixmap resources

# now delete any /SMask files not filtered out before
removed = 0
for xref in smasks:
    fn = "img-%i.png" % xref
    if os.path.exists(fn):
        os.remove(fn)
        removed += 1

t1 = time.clock()
print("run time", round(t1-t0, 2))
print("extracted images", (imgcount - removed))
JorjMcKie commented 6 years ago

With this script all duplicate or small files of a directory of PNGs are removed in blinding speed. Only 34 images from the above PDF survived this process.

import hashlib
import os, sys
pngdir = sys.argv[1] # where the PNGs live
pngfiles = os.listdir(pngdir)
shatab = []
dups = 0
total = 0
small = 2048 # file size limit
for f in pngfiles:
    if not f.endswith(".png"):
        continue
    total += 1
    fname = os.path.join(pngdir, f)
    x = open(fname, "rb").read()
    m = hashlib.sha256()
    m.update(x)
    f_sha = m.digest()
    if f_sha in shatab or len(x) <= small:
        os.remove(fname)
        dups += 1
    else:
        shatab.append(f_sha)

print("Removed %i duplicate or small files from a total of %i." % (dups, total))
JorjMcKie commented 6 years ago

I think we can close this issue again now ...

MikeTheWatchGuy commented 6 years ago

You are SO amazingly AWESOME Jorj!!!

Damn, I was going to write one of these.

You need to release this on GitHub! Maybe clean it up for release 😊 but it’s very powerful and should be available to simply copy, paste, and use.

This Python trend of solving a problem by searching for it has been extraordinary.

I need to understand, for example more about exceptions. I know already my question has been asked and that someone has posted a VERY elegant solution that I can learn from AND immediately use in my code.

For a seasoned programmer, the boost in productivity is something I never saw coming.

However, I also see people throwing together this code without understanding a single thing about it. Which is comforting in many ways as I am fortunate enough to have a solid education in design and programming… just as YOU do too! LOL

Thanks much! I’ll be stealing this from you too, thank you

-mike

From: Jorj X. McKie [mailto:notifications@github.com] Sent: Sunday, November 26, 2017 5:18 AM To: rk700/PyMuPDF PyMuPDF@noreply.github.com Cc: MikeTheWatchGuy mike_barnett@hotmail.com; Mention mention@noreply.github.com Subject: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)

With this script all duplicate or small files of a directory of PNGs are removed in blinding speed. Only 34 images from the above PDF survived this process.

import hashlib

import os, sys

pngdir = sys.argv[1] # where the PNGs live

pngfiles = os.listdir(pngdir)

shatab = []

dups = 0

total = 0

small = 2048 # file size limit

for f in pngfiles:

if not f.endswith(".png"):

    continue

total += 1

fname = os.path.join(pngdir, f)

x = open(fname, "rb").read()

m = hashlib.sha256()

m.update(x)

f_sha = m.digest()

if f_sha in shatab or len(x) <= small:

    os.remove(fname)

    dups += 1

else:

    shatab.append(f_sha)

print("Removed %i duplicate or small files from a total of %i." % (dups, total))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frk700%2FPyMuPDF%2Fissues%2F105%23issuecomment-346997608&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=D%2Big0W4G%2FYyCLHWoQHDne%2BUgRTken74ml53xN6qKJCU%3D&reserved=0, or mute the threadhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAND8wTIoRKjDKjdD2BJpA6MtotUFdRC3ks5s6TrigaJpZM4Qhm-5&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=JXlCMISYqje%2FiBxzy7n1pPh4npgTVT%2FZ5f1LxLB1DbI%3D&reserved=0.

JorjMcKie commented 6 years ago

Thank you again, you are very flattering, indeed!

I wanted to come back and discuss your problem concerning how to recognize irrelevant PNGs, i.e. images not showing photos of watches, but arbitrary graphical artifacts (EBAY marketing messages, screen control buttons and more stuff like that).

A very useful and very simple filter is, that the picture dimensions should be large enough to accept it as a photo. I set that to 100 pixels, which is probably on the low side.

The other recent updates to extract-img.py follow the implicit argument, that photos should be distinguishable from other graphics by an adequately defined “complexity” – with reasonable certainty (i.e. what is more harmful: omitting a valuable photo or retaining a useless image …). So I implemented the hypothesis, that a real-world photo should show more resistance against compression methods like that of the PNG file format. I am therefore taking the quotient of PNG-size vs. plain-pixel-size as my complexity measure. What I found looking at your (beautiful!) watch images: this quotient is never below 0.20 for any watch photo. In most cases it is a lot higher (0.3 to 0.4). Therefore, using this as a filter criterion should be all: fast (performance 😊!), easy to implement and corect. I tried several thresholds between 0.01 and 0.05 (corresponding to compression ratios of 100:1 to 20:1, respectively). The value 0.05 seems to be (in my / your 17 PDF cases) high enough and safe enough.

The implementaion is very simple, PyMuPDF has it all: it just is the value len(pix.getPNGData()) / len(pix.samples). If a picture passes this test, then pix.getPNGData() can be directly used as the content of a file opened as binary with ofile.write() to create the PNG file.

Von: MikeTheWatchGuy [mailto:notifications@github.com] Gesendet: Sonntag, 26. November 2017 15:14 An: rk700/PyMuPDF PyMuPDF@noreply.github.com Cc: Jorj X. McKie jorj.x.mckie@outlook.de; State change state_change@noreply.github.com Betreff: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)

You are SO amazingly AWESOME Jorj!!!

Damn, I was going to write one of these.

You need to release this on GitHub! Maybe clean it up for release 😊 but it’s very powerful and should be available to simply copy, paste, and use.

This Python trend of solving a problem by searching for it has been extraordinary.

I need to understand, for example more about exceptions. I know already my question has been asked and that someone has posted a VERY elegant solution that I can learn from AND immediately use in my code.

For a seasoned programmer, the boost in productivity is something I never saw coming.

However, I also see people throwing together this code without understanding a single thing about it. Which is comforting in many ways as I am fortunate enough to have a solid education in design and programming… just as YOU do too! LOL

Thanks much! I’ll be stealing this from you too, thank you

-mike

From: Jorj X. McKie [mailto:notifications@github.com] Sent: Sunday, November 26, 2017 5:18 AM To: rk700/PyMuPDF PyMuPDF@noreply.github.com<mailto:PyMuPDF@noreply.github.com> Cc: MikeTheWatchGuy mike_barnett@hotmail.com<mailto:mike_barnett@hotmail.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)

With this script all duplicate or small files of a directory of PNGs are removed in blinding speed. Only 34 images from the above PDF survived this process.

import hashlib

import os, sys

pngdir = sys.argv[1] # where the PNGs live

pngfiles = os.listdir(pngdir)

shatab = []

dups = 0

total = 0

small = 2048 # file size limit

for f in pngfiles:

if not f.endswith(".png"):

continue

total += 1

fname = os.path.join(pngdir, f)

x = open(fname, "rb").read()

m = hashlib.sha256()

m.update(x)

f_sha = m.digest()

if f_sha in shatab or len(x) <= small:

os.remove(fname)

dups += 1

else:

shatab.append(f_sha)

print("Removed %i duplicate or small files from a total of %i." % (dups, total))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frk700%2FPyMuPDF%2Fissues%2F105%23issuecomment-346997608&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=D%2Big0W4G%2FYyCLHWoQHDne%2BUgRTken74ml53xN6qKJCU%3D&reserved=0, or mute the threadhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAND8wTIoRKjDKjdD2BJpA6MtotUFdRC3ks5s6TrigaJpZM4Qhm-5&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=JXlCMISYqje%2FiBxzy7n1pPh4npgTVT%2FZ5f1LxLB1DbI%3D&reserved=0.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/rk700/PyMuPDF/issues/105#issuecomment-347031085, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH6BogBwlyZxfMnbDIR1hBq-obaKNcCOks5s6bhygaJpZM4Qhm-5.

MikeTheWatchGuy commented 6 years ago

Wow you went all-in

I’m finding that the image size is the best discriminator and I’m not getting a bunch of trash files… thanks to your awesome code!

However, I will look at your suggestion. I need to also incorporate that delete duplicates code. I’m excited to have that!

I understand why other packages would want to use your code for PDF files. It’s FAST, accurate, and really does a great job with no headaches. You should be proud 😊

From: Jorj X. McKie [mailto:notifications@github.com] Sent: Sunday, November 26, 2017 3:12 PM To: rk700/PyMuPDF PyMuPDF@noreply.github.com Cc: MikeTheWatchGuy mike_barnett@hotmail.com; Mention mention@noreply.github.com Subject: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)

Thank you again, you are very flattering, indeed!

I wanted to come back and discuss your problem concerning how to recognize irrelevant PNGs, i.e. images not showing photos of watches, but arbitrary graphical artifacts (EBAY marketing messages, screen control buttons and more stuff like that).

A very useful and very simple filter is, that the picture dimensions should be large enough to accept it as a photo. I set that to 100 pixels, which is probably on the low side.

The other recent updates to extract-img.py follow the implicit argument, that photos should be distinguishable from other graphics by an adequately defined “complexity” – with reasonable certainty (i.e. what is more harmful: omitting a valuable photo or retaining a useless image …). So I implemented the hypothesis, that a real-world photo should show more resistance against compression methods like that of the PNG file format. I am therefore taking the quotient of PNG-size vs. plain-pixel-size as my complexity measure. What I found looking at your (beautiful!) watch images: this quotient is never below 0.20 for any watch photo. In most cases it is a lot higher (0.3 to 0.4). Therefore, using this as a filter criterion should be all: fast (performance 😊!), easy to implement and corect. I tried several thresholds between 0.01 and 0.05 (corresponding to compression ratios of 100:1 to 20:1, respectively). The value 0.05 seems to be (in my / your 17 PDF cases) high enough and safe enough.

The implementaion is very simple, PyMuPDF has it all: it just is the value len(pix.getPNGData()) / len(pix.samples). If a picture passes this test, then pix.getPNGData() can be directly used as the content of a file opened as binary with ofile.write() to create the PNG file.

Von: MikeTheWatchGuy [mailto:notifications@github.com] Gesendet: Sonntag, 26. November 2017 15:14 An: rk700/PyMuPDF PyMuPDF@noreply.github.com<mailto:PyMuPDF@noreply.github.com> Cc: Jorj X. McKie jorj.x.mckie@outlook.de<mailto:jorj.x.mckie@outlook.de>; State change state_change@noreply.github.com<mailto:state_change@noreply.github.com> Betreff: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)

You are SO amazingly AWESOME Jorj!!!

Damn, I was going to write one of these.

You need to release this on GitHub! Maybe clean it up for release 😊 but it’s very powerful and should be available to simply copy, paste, and use.

This Python trend of solving a problem by searching for it has been extraordinary.

I need to understand, for example more about exceptions. I know already my question has been asked and that someone has posted a VERY elegant solution that I can learn from AND immediately use in my code.

For a seasoned programmer, the boost in productivity is something I never saw coming.

However, I also see people throwing together this code without understanding a single thing about it. Which is comforting in many ways as I am fortunate enough to have a solid education in design and programming… just as YOU do too! LOL

Thanks much! I’ll be stealing this from you too, thank you

-mike

From: Jorj X. McKie [mailto:notifications@github.com] Sent: Sunday, November 26, 2017 5:18 AM To: rk700/PyMuPDF PyMuPDF@noreply.github.com<mailto:PyMuPDF@noreply.github.com<mailto:PyMuPDF@noreply.github.com%3cmailto:PyMuPDF@noreply.github.com>> Cc: MikeTheWatchGuy mike_barnett@hotmail.com<mailto:mike_barnett@hotmail.com<mailto:mike_barnett@hotmail.com%3cmailto:mike_barnett@hotmail.com>>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com<mailto:mention@noreply.github.com%3cmailto:mention@noreply.github.com>> Subject: Re: [rk700/PyMuPDF] Fitz.open is not passing back exception (#105)

With this script all duplicate or small files of a directory of PNGs are removed in blinding speed. Only 34 images from the above PDF survived this process.

import hashlib

import os, sys

pngdir = sys.argv[1] # where the PNGs live

pngfiles = os.listdir(pngdir)

shatab = []

dups = 0

total = 0

small = 2048 # file size limit

for f in pngfiles:

if not f.endswith(".png"):

continue

total += 1

fname = os.path.join(pngdir, f)

x = open(fname, "rb").read()

m = hashlib.sha256()

m.update(x)

f_sha = m.digest()

if f_sha in shatab or len(x) <= small:

os.remove(fname)

dups += 1

else:

shatab.append(f_sha)

print("Removed %i duplicate or small files from a total of %i." % (dups, total))

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frk700%2FPyMuPDF%2Fissues%2F105%23issuecomment-346997608&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=D%2Big0W4G%2FYyCLHWoQHDne%2BUgRTken74ml53xN6qKJCU%3D&reserved=0, or mute the threadhttps://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAND8wTIoRKjDKjdD2BJpA6MtotUFdRC3ks5s6TrigaJpZM4Qhm-5&data=02%7C01%7Cmike_barnett%40hotmail.com%7C5a8990cf34ac46027a5e08d534b6ff59%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636472882922682408&sdata=JXlCMISYqje%2FiBxzy7n1pPh4npgTVT%2FZ5f1LxLB1DbI%3D&reserved=0.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/rk700/PyMuPDF/issues/105#issuecomment-347031085, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AH6BogBwlyZxfMnbDIR1hBq-obaKNcCOks5s6bhygaJpZM4Qhm-5.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frk700%2FPyMuPDF%2Fissues%2F105%23issuecomment-347034819&data=02%7C01%7Cmike_barnett%40hotmail.com%7C81eee083b8104e479dd608d53509f7b8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636473239282377784&sdata=RyouhbGIlRXOVaiwn3nuJryXzwpeS4xNh4BkLhIrxto%3D&reserved=0, or mute the threadhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAND8wWrX2rosdZy5p-EdZn05wbpkfNL4ks5s6cYVgaJpZM4Qhm-5&data=02%7C01%7Cmike_barnett%40hotmail.com%7C81eee083b8104e479dd608d53509f7b8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636473239282377784&sdata=ZmrCpJ3GigabMyQxSyzGa0RxHhCyETsoMfONRmWicpM%3D&reserved=0.