holtwick / pdfify

Issue tracking for PDFify. To upvote features give a šŸ‘
https://pdfify.app/future?ref=github&kw=start
12 stars 0 forks source link

Text scanning (OCR) not working after scan #67

Closed dieTausendsassa closed 2 years ago

dieTausendsassa commented 3 years ago

Please describe how the error can be reproduced: 1) Scan a document/page from a Scanner 2) Got to "Texterkennung" 3) Text scanning OCR don't work :-(

After saving and reopening the scanned document/page the OCR function works ...

App Version Info: de.holtwick.mac.PDFify@3.3.2+130 WGAWN9BYE50VF9GW45BPVN60WG

luettfite commented 3 years ago

I can confirm this, same issue here.

holtwick commented 2 years ago

Another scanning issue. Internal link: https://1b.replies.io/#182916/threads/1373311

czux commented 2 years ago

Is there an update on this? I found out about the problem after scanning roughtly 3000 pages only to find there's no OCR text in the PDFs. The nasty thing is that after doing a scan OCR seems to be working okay but it's not. Log file reports "OCR done..."

When opening a PDF directly Tesseract will do a proper OCR.

holtwick commented 2 years ago

This should be fixed in the next release. Please verify by using the beta version: https://pdfify.app/help#beta

luettfite commented 2 years ago

I tested with beta version 3.3.5-144. Unfortunately the issue is still the same. OCR doesn't work directly after scan. I have to save, close and open the document again. After this OCR works. The problem seems to be related to the type of the scan. I have this problem, if I use the types "Text" or "Farbe" (Color) in the scanner dialog. If I use type "SchwarzweiƟ" (B&W), then OCR works on the fly, as expected.

czux commented 2 years ago

I can confirm all of this. I tested with version 3.3.5-142 and have experienced the exact same thing. Selecting "SchwarzweiƟ" (Greyscale) in the Scanner Dialog delivers an OCR as expected. Selecting "Farbe" (Color) or "Text" (1-bit B&W) leaves me with no OCR at all. I'm using an EPSON WorkForce Pro WF-C5710 DWF. @luettfite what Scanner do you use?

luettfite commented 2 years ago

@czux EPSON Stylus Office BX625FWD with EPSON SO BX620FWD Series driver version 10.85

czux commented 2 years ago

When using EPSON Scan to scan the documents first (b/w, graysclae, color) and then feeding them to PDFify, OCR works as expected.

Only when doing the Scan from PDFify itself, it will fail for 2 of the 3 available color options.

holtwick commented 2 years ago

Thanks for the details, this helps a lot. I test on a Canon and was not able to reproduce the problem, even when using different modes. Black & white seems to have general issues and is not working for me at all. To me this looks like a macOS bug. However, could you please share your specific scanner options and OCR engine used? Thanks!

20220311-111615-capture-holtwick@2x

luettfite commented 2 years ago

I use Tesseract OCR German: 1_Tessact_OCR_DE

My settings in the scanner dialog, where OCR after scan succeeds: 3_Scan_Settings_Grey

My settings in the scanner dialog, where OCR after scan fails: 2_Scan_Settings_Text 4_Scan_Settings_Color

I also tested with other DPI values, but there was no difference in the OCR behavior.

Then I did another test and changed the engine to Apple Vision OCR German: 5_AppleVision_OCR_DE

Now OCR after scan did work for all three types. So the OCR engine seems to make a difference too.

czux commented 2 years ago

For my setup it is exactly like @luettfite descibed it. I use the same settings and can confirm all of the observations.

holtwick commented 2 years ago

Thanks for the feedback. I'm still investigating, your comments will help.

czux commented 2 years ago

Using 3.3.5 (145) with Tesseract OCR I can still confirm all of the above.

Assuming that the internal image passing from scan into PDFify is happening in TIFF format, I tried the following. Using only the macOS scan dialog outside PDFify, I produced a color and a grayscale PDF as well as a color and a grayscale TIFF.

These are the results when dragging each of those into PDFify (only one at a time of course):

PDF color: OCR okay PDF grayscale: OCR okay TIFF color: no OCR TIFF grayscale: OCR okay

In case of the color TIFF the OCR progress bar only flickers for a fraction of a second. In all other cases it takes a few seconds and shows proper OCR progress.

Feeding the above testfiles into OCRmyPDF (which also uses Tesseract) produces searcheable PDFs in all four cases.

Maybe that is a useful piece of info to solve this issue.

holtwick commented 2 years ago

I modified some things in the scanner dialog and think it should work better now. Please try the beta. See https://pdfify.app/en/help#beta

luettfite commented 2 years ago

Sorry for the delay. I tested again with beta Version 3.5b4 (161). Unfortunately there is still the same issue for B&W and Color mode.

czux commented 2 years ago

I get a proper OCR with version 3.4 (153) using my EPSON WF-C5710 printer/scanner which failed to OCR before.