ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.19k stars 970 forks source link

Show progress during postprocessing #1313

Open user1823 opened 2 months ago

user1823 commented 2 months ago

For large files, postprocessing takes a lot of time. Showing some progress here would make the UX better.

The main motivation behind this request was that ocrmypdf is stuck on this step (postprocessing) for about 30 min.

And now, it is stuck on this step:

jbarlow83 commented 2 months ago

That's when we ask Ghostscript to do PDF/A. Unfortunately, it doesn't give much feedback, so there's not much I can work with it. At least I'm not aware of any behavior I can monitor. It's also single threaded. Color space conversion of large images can be quite expensive in Ghostscript and is often responsible for long delays.

user1823 commented 2 months ago

That's when we ask Ghostscript to do PDF/A.

But, in the above case, I used --output-type pdf. So, there would be no PDF/A conversion.

In the above case, I guess that most of the time was consumed for doing the equivalent of the following (obtained by running with -v1 on a different file):

Postprocessing...                                                                                             ocr.py:145
Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '--version']                                __init__.py:133
xref 13: treating as an optimization candidate                                                           optimize.py:279
xref 12: treating as an optimization candidate                                                           optimize.py:279
XrefExt(xref=12, ext='.png')                                                                             optimize.py:344
XrefExt(xref=13, ext='.png')                                                                             optimize.py:344
Optimizable images: JPEGs: 0 PNGs: 2                                                                     optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--

Unfortunately, it doesn't give much feedback, so there's not much I can work with it. At least I'm not aware of any behavior I can monitor.

If I run this:

gswin64c.exe -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -sOutputFile=out.pdf test.pdf

I get:

Processing pages 1 through 2.
Page 1
Page 2

So, you can probably monitor the number of pages processed, which you can use to show the progress.

user1823 commented 2 months ago

I am now using v16.3.0 and it seems that the changes made in https://github.com/ocrmypdf/OCRmyPDF/commit/950c700274299c016a529b0e552bb0b3bda6da66 or https://github.com/ocrmypdf/OCRmyPDF/commit/9a3c5a3f7cc863bbd2fdaebbed375a7a526b8e43 have resulted in a bug.

The progress bar in "OCR" says 1182 out of 591.

Also, the following step takes too much time:

What is ocrmypdf doing at this stage? Can we have a progress for this too?

jbarlow83 commented 2 months ago

Thanks for "OCR" progress bar issue report - fixed.

After "Total file size..." nothing is happening except copying the finished file from temporary storage to its final output location. Unless you're dealing with very large PDFs (GBs), this suggests network issues or file system contention. How long is "too much time?"

user1823 commented 2 months ago

except copying the finished file from temporary storage to its final output location.

Probably also cleaning up all the temp files generated (for e.g., the images)

When ocrmypdf is at this step, I can see the output file in the target directory (with the correct filesize, which means that it is likely not just a placeholder). So, I think that cleaning the temp files is actually what is taking the time.

How long is "too much time?"

Maybe 2-3 minutes. It is not too much when compared to the total time taken. But, it feels too much when you don't know what is happening and how long it is going to last. So, adding a progress here also would be nice.