ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
12.79k stars 936 forks source link

Background removal too aggresive #624

Open thibaultmol opened 3 years ago

thibaultmol commented 3 years ago

I'm giving 20$/year on opencollective

Describe the bug On this pdf (extracted the first page) when I do background removal, it removes part of the image.

To Reproduce What command line or API call were you trying to run?

ocrmypdf --clean --clean-final --rotate-pages --deskew --remove-background testfile.pdf testfile-output.pdf 

Run with verbosity or higher -v1 to see more detailed logging. This information may be helpful.

Example file If your issue is a problem that affects only certain files, and we will require an input file (PDF or image) that demonstrates your issue.

Please provide an input file with no personal or confidential information. https://drive.google.com/file/d/18CWK_01lrd0sCNWqFs26K1aLT2zGuRuf/view?usp=sharing https://drive.google.com/file/d/1UrOA-_Ex_7Z3llp8VuHjrYO9GuzFig8t/view?usp=sharing

Click here to see log ~/Scanner >>> ocrmypdf --clean --clean-final --rotate-pages --deskew --remove-background -v1 testfile.pdf testfile-output.pdf [2] ocrmypdf 11.0.1 Running: ['tesseract', '--list-langs'] No language specified; assuming --language eng Running: ['unpaper', '--version'] Found unpaper 6.1 Running: ['tesseract', '--version'] Found tesseract 4.1.1 Running: ['tesseract', '-l', 'eng', '--print-parameters', 'pdf'] Running: ['gs', '--version'] Found gs 9.52 pikepdf mmap enabled os.symlink(testfile.pdf, /tmp/com.github.ocrmypdf.te0_eb9l/origin) os.symlink(/tmp/com.github.ocrmypdf.te0_eb9l/origin, /tmp/com.github.ocrmypdf.te0_eb9l/origin.pdf) pikepdf mmap enabled Scanning contents: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 264.22page/s] Using Tesseract OpenMP thread limit 3 pikepdf mmap enabled 1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.te0_eb9l/origin.pdf'] 1 Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/com.github.ocrmypdf.te0_eb9l/000001_rasterize_preview.jpg', 'stdout'] 1 page is facing ⇧, confidence 1.59 - no change 1 Rasterize with png16m 1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/com.github.ocrmypdf.te0_eb9l/origin.pdf'] 1 Running: ['unpaper', '-v', '--dpi', '150.0', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/com.github.ocrmypdf.te0_eb9l/000001_pp_deskew.png', '/tmp/tmpqrtkq7h5/output.ppm'] 1 None 1 resolution (150, 150) 1 convert 1 imgformat = PNG 1 input dpi = 150 x 150 1 rotation = 0° 1 input colorspace = RGB 1 width x height = 1275px x 1750px 1 read_images() embeds a PNG 1 convert done 1 Running: ['tesseract', '-l', 'eng', '-c', 'textonly_pdf=1', PosixPath('/tmp/com.github.ocrmypdf.te0_eb9l/000001_ocr.png'), '/tmp/com.github.ocrmypdf.te0_eb9l/000001_ocr_tess', 'pdf', 'txt'] Emplacement update Rotations for page 0: [text, auto, misalign, content] = 0, 0, 0, 0 Grafting OCR: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.0/1.0 [00:03<00:00, 3.34s/page] os.symlink(/tmp/com.github.ocrmypdf.te0_eb9l/graft_layers.pdf, /tmp/com.github.ocrmypdf.te0_eb9l/fix_docinfo.pdf) Running: ['gs', '-dQUIET', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=RGB', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/com.github.ocrmypdf.te0_eb9l/fix_docinfo.pdf', '/tmp/com.github.ocrmypdf.te0_eb9l/pdfa.ps'] XrefExt(xref=20, ext='.png') Optimizable images: JPEGs: 0 PNGs: 1 JPEGs: 0image [00:00, ?image/s] Optimizable images: JBIG2 groups: (0,) JBIG2: 0item [00:00, ?item/s] Optimize ratio: 1.00 savings: 0.0% os.symlink(/tmp/com.github.ocrmypdf.te0_eb9l/optimize.opt.pdf, /tmp/com.github.ocrmypdf.te0_eb9l/optimize.pdf) /tmp/com.github.ocrmypdf.te0_eb9l/optimize.pdf -> testfile-output.pdf Output file is a PDF/A-2B (as expected)

Links to files hosted elsewhere are perfectly acceptable. You could also look in tests/resources and see if any of those files reproduce your issue.

(Issues without example files usually cannot be resolved. It's like reporting an issue against a web browser without providing a URL.)

Expected behavior A clear and concise description of what you expected to happen. Have the center image be unaffected by the background removal

Operating System: Manjaro Linux KDE Plasma Version: 5.19.4 KDE Frameworks Version: 5.73.0 Qt Version: 5.15.0 Kernel Version: 5.7.17-2-MANJARO OS Type: 64-bit Processors: 16 × AMD Ryzen 7 3700X 8-Core Processor Memory: 31,3 GiB of RAM Graphics Processor: GeForce GTX 1080/PCIe/SSE2

Installed using Yay

jbarlow83 commented 3 years ago

Thank you for your support!

This is a tricky image given the low contrast and use of the background color inside and outside the foreground. I suppose I could add some tuning parameters to make this adjustable (rather than numbers that work well in my experience), but I don't think there's a way to get this to work automatically for all images.

Or if you don't mind temporarily editing ocrmypdf, you could change src/ocrmypdf/_pipeline.py:459 to:

remove_background(input_file, output_file, tile_size=(150, 220), black_threshold=70, white_threshold=230)

(I found parameters that work for your image.)