virantha / pypdfocr

Python script to do PDF OCR conversion using Tesseract
Apache License 2.0
372 stars 114 forks source link

Large multi page PDFs increase in processing time expotentially. #37

Closed matt12eagles closed 8 years ago

matt12eagles commented 8 years ago

Hello,

I am using pypdfocr on Windows. It is a great tool when compiled into an easy to use EXE. Anyways, I have passed about 600 images into the tool marking start and end time. Large pdf images (300+) pages take quite long (about 3-5 hours). While 200 page images take about 45 minutes. Since some of my off images are 1000+ pages this becomes quite troublesome. My theory is that the issue is caused by the merge peice of pypdf. I will have to debug, but it seems to only occur after ghost script, and teaseract complete. This is indicated by the growing then restart of file size. Any ideas on how we can make the processing time linear??? I will begin investigating the pypdf piece of the process. Thank you again for this tool Virantha. It's accuracy and simplicity is truely a pleasure to use.

matt12eagles commented 8 years ago

I apologize for the Mis spelled title.... T is supposed to be "multi page" not multi lateral... Still looking for a way to fix it

matt12eagles commented 8 years ago

Okay... I think I traced where the time delay is coming from....

It is coming from the def overlay_hocr_pages(self, dpi, hocr_filenames, orig_pdf_filename): function

The issue is with this FOR loop

for orig_pg, text_pg_filename in zip(self.iter_pdf_page(orig), text_pdf_filenames):

        text_file = open(text_pg_filename, 'rb') 

        text_pg = self.iter_pdf_page(text_file).next() 

        orig_rotation_angle = int(orig_pg.get('/Rotate', 0)) 

        if orig_rotation_angle != 0: 

            logging.info("Original Rotation: %s" % orig_pg.get("/Rotate", 0)) 

            self.mergeRotateAroundPointPage(orig_pg, text_pg, orig_rotation_angle, text_pg.mediaBox.getWidth()/2, text_pg.mediaBox.getWidth()/2) 

            # None of these commands worked for me: 

                #orig_pg.rotateCounterClockwise(orig_rotation_angle) 

                #orig_pg.mergeRotatedPage(text_pg,text_rotation_angle) 

        else: 

            orig_pg.mergePage(text_pg) 

        orig_pg.compressContentStreams() 

        writer.addPage(orig_pg) 

        with open(pdf_filename, 'wb') as f: 

            # Flush out this page merge so we can close the text_file 

            writer.write(f) 

        text_file.close() 

    orig.close() 

It takes a LONG LONG time to complete for large images.

Watching the compiled PDF it will grow to over 1.5GB... go back down to 10mb then grow again to 1.52GB etc. It seems to do this for each page on the PDF.

Any ideas how we can speed up the merge??

virantha commented 8 years ago

I pushed a fix to the develop branch. Please try it out and let me know if it improves things. I don't have any large pdfs to test this.

Based on info here: http://stackoverflow.com/questions/17104926/pypdf-merging-multiple-pdf-files-into-one-pdf

matt12eagles commented 8 years ago

Thank you for the update Virantha, I am attempting to run some larger files in the exe now.

I apologize for the delayed response. I just a had a son so I was offline for a little while there.

I'll test and let you know how the results go.

I can attempt to provide a sample of 2-3 pdf's if you wish to test as well.

Thank you,

Matt

matt12eagles commented 8 years ago

Virantha, can you please re-compile and re-upload the exe version with the merge pdf change?? I compared the file to the one uploaded previously and it seems the file sizes are exactly the same. I apologize, I am having a difficult time finding where the pypdf2 change had been placed.

virantha commented 8 years ago

@matt12eagles I've updated the package to 0.9.0 and released the .exe (I don't usually build exe's from my non-release branches). Give it a shot now.

matt12eagles commented 8 years ago

Just ran.... and WOW the merge is much much faster. A 64 page document that used to take 1 hour and 10 minutes.... just finished in 12 minutes and 35 seconds. Excellent job and thank you for your great work!!

virantha commented 8 years ago

Great, glad it's better. Marking this issue as closed.

ps: Congrats on the son! (have two myself)