Closed eduardodataeasy closed 2 years ago
This looks like a problem in Windows Python 3.9 that is solved in Python 3.10. I suggest trying 3.10.
https://github.com/python/cpython/pull/24793
On Sep 22, 2022, at 06:03, eduardodataeasy @.***> wrote:
When I try to OCR a specific file it shows the following error log:
ocrmypdf --force-ocr --optimize 0 --fast-web-view 0 --output-type pdf -l por -v 1 --deskew --remove-background --clean "D:\applications\dotNet\EasyMidia\TESTE_OCR\IN\PROCESSADO_NUANCE_317311740_1_1.PDF" "D:\applications\dotNet\EasyMidia\TESTE_OCR\OUT\REPROCESSADO_NUANCE_317311740_1_2_teste.PDF" [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado ocrmypdf 12.0.1 Running: ['C:\Tesseract-OCR\tesseract.EXE', '--list-langs'] stdout/stderr = List of available languages (166): afr amh ara asm aze aze_cyrl bel ben bod bos bre bul cat ceb ces chi_sim chi_sim_vert chi_tra chi_tra_vert chr cos cym dan dan_frak deu deu_frak div dzo ell eng enm epo equ est eus fao fas fil fin fra frk frm fry gla gle glg grc guj hat heb hin hrv hun hye iku ind isl ita ita_old jav jpn jpn_vert kan kat kat_old kaz khm kir kmr kor kor_vert lao lat lav lit ltz mal mar mkd mlt mon mri msa mya nep nld nor oci ori osd pan pol por pus que ron rus san script/Arabic script/Armenian script/Bengali script/Canadian_Aboriginal script/Cherokee script/Cyrillic script/Devanagari script/Ethiopic script/Fraktur script/Georgian script/Greek script/Gujarati script/Gurmukhi script/HanS script/HanS_vert script/HanT script/HanT_vert script/Hangul script/Hangul_vert script/Hebrew script/Japanese script/Japanese_vert script/Kannada script/Khmer script/Lao script/Latin script/Malayalam script/Myanmar script/Oriya script/Sinhala script/Syriac script/Tamil script/Telugu script/Thaana script/Thai script/Tibetan script/Vietnamese sin slk slk_frak slv snd spa spa_old sqi srp srp_latn sun swa swe syr tam tat tel tgk tgl tha tir ton tur uig ukr urd uzb uzb_cyrl vie yid yor
Running: ['C:\unpaper\unpaper.EXE', '--version'] Found unpaper 6.2 Running: ['C:\Tesseract-OCR\tesseract.EXE', '--version'] Found tesseract 5.0.0-alpha.20210506 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '--version'] Found gs 9.54.0 Scanning contents: 0%| | 0/6 [00:00<?, ?page/s][WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6.55page/s] Using Tesseract OpenMP thread limit 1 Start processing 6 pages concurrently OCR: 0%| | 0.0/6.0 [00:00<?, ?page/s][WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado 2 Rasterize with pnggray, rotation 0 3 Rasterize with pngmono, rotation 0 1 Rasterize with png16m, rotation 0 4 Rasterize with pngmono, rotation 0 2 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=2', '-dLastPage=2', '-r99.943004x99.943004', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 3 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pngmono', '-dFirstPage=3', '-dLastPage=3', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 1 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r300.000000x300.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 4 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pngmono', '-dFirstPage=4', '-dLastPage=4', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 5 Rasterize with pngmono, rotation 0 6 Rasterize with pnggray, rotation 0 5 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pngmono', '-dFirstPage=5', '-dLastPage=5', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 6 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=6', '-dLastPage=6', '-r99.943004x99.943004', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 3 STREAM b'IHDR' 16 13 3 STREAM b'iCCP' 41 2296 3 iCCP profile name b'default_gray.icc' 3 Compression method 0 3 STREAM b'pHYs' 2349 9 3 STREAM b'tEXt' 2370 31 3 STREAM b'IDAT' 2413 8192 3 Rotating output by 0 2 STREAM b'IHDR' 16 13 2 STREAM b'iCCP' 41 2296 2 iCCP profile name b'default_gray.icc' 2 Compression method 0 2 STREAM b'pHYs' 2349 9 2 STREAM b'tEXt' 2370 31 2 STREAM b'IDAT' 2413 8192 2 Rotating output by 0 4 STREAM b'IHDR' 16 13 4 STREAM b'iCCP' 41 2296 4 iCCP profile name b'default_gray.icc' 4 Compression method 0 4 STREAM b'pHYs' 2349 9 4 STREAM b'tEXt' 2370 31 4 STREAM b'IDAT' 2413 8192 4 Rotating output by 0 6 STREAM b'IHDR' 16 13 6 STREAM b'iCCP' 41 2296 6 iCCP profile name b'default_gray.icc' 6 Compression method 0 6 STREAM b'pHYs' 2349 9 6 STREAM b'tEXt' 2370 31 6 STREAM b'IDAT' 2413 8192 6 Rotating output by 0 5 STREAM b'IHDR' 16 13 5 STREAM b'iCCP' 41 2296 5 iCCP profile name b'default_gray.icc' 5 Compression method 0 5 STREAM b'pHYs' 2349 9 5 STREAM b'tEXt' 2370 31 5 STREAM b'IDAT' 2413 8192 5 Rotating output by 0 3 background removal skipped on mono page 3 background removal skipped on mono page 4 background removal skipped on mono page 5 background removal skipped on mono page 3 STREAM b'IHDR' 16 13 3 STREAM b'pHYs' 41 9 3 STREAM b'IDAT' 62 8192 4 background removal skipped on mono page 3 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmpubwxtqst\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmpubwxtqst\output.pbm'] 5 background removal skipped on mono page 4 STREAM b'IHDR' 16 13 4 STREAM b'pHYs' 41 9 4 STREAM b'IDAT' 62 8192 6 STREAM b'IHDR' 16 13 6 STREAM b'pHYs' 41 9 6 STREAM b'IDAT' 62 8192 5 STREAM b'IHDR' 16 13 5 STREAM b'pHYs' 41 9 5 STREAM b'IDAT' 62 8192 4 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmpl5ohj40q\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmpl5ohj40q\output.pbm'] 6 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '99.943004', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm'] 2 STREAM b'IHDR' 16 13 2 STREAM b'pHYs' 41 9 2 STREAM b'IDAT' 62 8192 5 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmpbh3g3se7\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmpbh3g3se7\output.pbm'] 2 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '99.943004', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmp4i1yrr38\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmp4i1yrr38\output.pgm'] 6 stdout/stderr = unpaper 6.2 License GPLv2: GNU GPL version 2. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Processing sheet #1: C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\input.pnm -> C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm input-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\input.pnm output-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm sheet size: 826x1169 ... noise-filter ... deleted 107 clusters. blur-filter... deleted 0 pixels. writing output.
OCR: 0%| | 0.0/6.0 [00:02<?, ?page/s] An exception occurred while executing the pipeline multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "c:\python39\lib\shutil.py", line 616, in _rmtree_unsafe os.unlink(fullname) PermissionError: [WinError 32] O arquivo já está sendo usado por outro processo: 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "c:\python39\lib\tempfile.py", line 801, in onerror _os.unlink(path) PermissionError: [WinError 32] O arquivo já está sendo usado por outro processo: 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "c:\python39\lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 189, in exec_page_sync ocr_image, preprocess_out = make_intermediate_images( File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 158, in make_intermediate_images ocr_image = preprocess( File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 105, in preprocess image = preprocess_clean(image, page_context) File "c:\python39\lib\site-packages\ocrmypdf_pipeline.py", line 486, in preprocess_clean unpaper.clean( File "c:\python39\lib\site-packages\ocrmypdf_exec\unpaper.py", line 134, in clean run(input_file, output_file, dpi=dpi, mode_args=unpaper_args) File "c:\python39\lib\site-packages\ocrmypdf_exec\unpaper.py", line 100, in run raise SubprocessOutputError( File "c:\python39\lib\tempfile.py", line 826, in exit self.cleanup() File "c:\python39\lib\tempfile.py", line 830, in cleanup self._rmtree(self.name) File "c:\python39\lib\tempfile.py", line 812, in _rmtree _shutil.rmtree(name, onerror=onerror) File "c:\python39\lib\shutil.py", line 740, in rmtree return _rmtree_unsafe(path, onerror) File "c:\python39\lib\shutil.py", line 618, in _rmtree_unsafe onerror(os.unlink, fullname, sys.exc_info()) File "c:\python39\lib\tempfile.py", line 804, in onerror cls._rmtree(path) File "c:\python39\lib\tempfile.py", line 812, in _rmtree _shutil.rmtree(name, onerror=onerror) File "c:\python39\lib\shutil.py", line 740, in rmtree return _rmtree_unsafe(path, onerror) File "c:\python39\lib\shutil.py", line 599, in _rmtree_unsafe onerror(os.scandir, path, sys.exc_info()) File "c:\python39\lib\shutil.py", line 596, in _rmtree_unsafe with os.scandir(path) as scandir_it: NotADirectoryError: [WinError 267] O nome do diretório é inválido: 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm' """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 374, in run_pipeline exec_concurrent(context, executor) File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 271, in exec_concurrent executor( File "c:\python39\lib\site-packages\ocrmypdf_concurrent.py", line 82, in call self._execute( File "c:\python39\lib\site-packages\ocrmypdf\builtin_plugins\concurrency.py", line 132, in _execute for result in results: File "c:\python39\lib\multiprocessing\pool.py", line 870, in next raise value NotADirectoryError: [Errno 20] O nome do diretório é inválido: 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm'
Test file: 317311740_1_1.PDF
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.
So in theory just update python 3.10. I'll try and let you know the result. I didn't quite understand the ocrmypdf update. But I will try.
Yes, just update to python 3.10. The version of ocrmypdf shouldn't matter much for this issue.
ocrmypdf --force-ocr --optimize 0 --fast-web-view 0 --output-type pdf -l por -v 1 --deskew --remove-background --clean "D:\applications\dotNet\EasyMidia\TESTE_OCR\IN\PROCESSADO_NUANCE_317311740_0_1.pdf" "D:\applications\dotNet\EasyMidia\TESTE_OCR\OUT\REPROCESSADO_NUANCE_317311740_1_3_teste.PDF" ocrmypdf 14.0.0 Running: ['C:\unpaper\unpaper.EXE', '--version'] Found unpaper 6.2 Running: ['C:\Tesseract-OCR\tesseract.EXE', '--version'] Found tesseract 5.0.0-alpha.20210506 Running: ['C:\Tesseract-OCR\tesseract.EXE', '--version'] Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '--version'] Found gs 9.54.0 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '--version'] Running: ['C:\Tesseract-OCR\tesseract.EXE', '--list-langs'] stdout/stderr = List of available languages (166): afr amh ara asm aze aze_cyrl bel ben bod bos bre bul cat ceb ces chi_sim chi_sim_vert chi_tra chi_tra_vert chr cos cym dan dan_frak deu deu_frak div dzo ell eng enm epo equ est eus fao fas fil fin fra frk frm fry gla gle glg grc guj hat heb hin hrv hun hye iku ind isl ita ita_old jav jpn jpn_vert kan kat kat_old kaz khm kir kmr kor kor_vert lao lat lav lit ltz mal mar mkd mlt mon mri msa mya nep nld nor oci ori osd pan pol por pus que ron rus san script/Arabic script/Armenian script/Bengali script/Canadian_Aboriginal script/Cherokee script/Cyrillic script/Devanagari script/Ethiopic script/Fraktur script/Georgian script/Greek script/Gujarati script/Gurmukhi script/HanS script/HanS_vert script/HanT script/HanT_vert script/Hangul script/Hangul_vert script/Hebrew script/Japanese script/Japanese_vert script/Kannada script/Khmer script/Lao script/Latin script/Malayalam script/Myanmar script/Oriya script/Sinhala script/Syriac script/Tamil script/Telugu script/Thaana script/Thai script/Tibetan script/Vietnamese sin slk slk_frak slv snd spa spa_old sqi srp srp_latn sun swa swe syr tam tat tel tgk tgl tha tir ton tur uig ukr urd uzb uzb_cyrl vie yid yor
Opened a file Scanning contents: 100%|██████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 333.30page/s] Using Tesseract OpenMP thread limit 1 Start processing 6 pages concurrently Opened a file 1 Rasterize with png16m, rotation 0 2 Rasterize with pnggray, rotation 0 3 Rasterize with pngmono, rotation 0 4 Rasterize with pngmono, rotation 0 5 Rasterize with pngmono, rotation 0 6 Rasterize with pnggray, rotation 0 1 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r300.000000x300.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\origin.pdf'] 2 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pnggray', '-dFirstPage=2', '-dLastPage=2', '-r99.943004x99.943004', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\origin.pdf'] 3 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=3', '-dLastPage=3', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\origin.pdf'] 5 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=5', '-dLastPage=5', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\origin.pdf'] 6 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pnggray', '-dFirstPage=6', '-dLastPage=6', '-r99.943004x99.943004', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\origin.pdf'] 4 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=4', '-dLastPage=4', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\origin.pdf'] 3 Rotating output by 0 2 Rotating output by 0 6 Rotating output by 0 5 Rotating output by 0 4 Rotating output by 0 3 background removal skipped on mono page 3 Running: ['C:\Tesseract-OCR\tesseract.EXE', '-l', 'por', '--psm', '2', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000003_rasterize.png', 'stdout'] 4 background removal skipped on mono page 5 background removal skipped on mono page 4 Running: ['C:\Tesseract-OCR\tesseract.EXE', '-l', 'por', '--psm', '2', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000004_rasterize.png', 'stdout'] 5 Running: ['C:\Tesseract-OCR\tesseract.EXE', '-l', 'por', '--psm', '2', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000005_rasterize.png', 'stdout'] 3 background removal skipped on mono page 3 Running: ['C:\Tesseract-OCR\tesseract.EXE', '-l', 'por', '--psm', '2', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000003_rasterize.png', 'stdout'] 4 background removal skipped on mono page 4 Running: ['C:\Tesseract-OCR\tesseract.EXE', '-l', 'por', '--psm', '2', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000004_rasterize.png', 'stdout'] 5 background removal skipped on mono page 5 Running: ['C:\Tesseract-OCR\tesseract.EXE', '-l', 'por', '--psm', '2', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000005_rasterize.png', 'stdout'] 1 Rotating output by 0 3 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000003_pp_deskew.png', 'C:\Users\eduar\AppData\Local\Temp\tmpdhsw8f1x\output.pnm'] 4 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000004_pp_deskew.png', 'C:\Users\eduar\AppData\Local\Temp\tmp4owguc1k\output.pnm'] 5 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000005_pp_deskew.png', 'C:\Users\eduar\AppData\Local\Temp\tmpba2c5vul\output.pnm'] 3 stdout/stderr = unpaper 6.2 License GPLv2: GNU GPL version 2. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Processing sheet #1: C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000003_pp_deskew.png -> C:\Users\eduar\AppData\Local\Temp\tmpdhsw8f1x\output.pnm input-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000003_pp_deskew.png output-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\tmpdhsw8f1x\output.pnm sheet size: 2480x3509 ... noise-filter ... deleted 0 clusters. blur-filter... deleted 0 pixels. writing output.
3 resolution (299.9994, 299.9994)
3 convert
3 PIL format = PNG
3 imgformat = PNG
3 input dpi = 300 x 300
3 rotation = 0°
3 input colorspace = 1
3 width x height = 2480px x 3509px
3 read_images() embeds a PNG
3 convert done
3 Running: ['C:\\Tesseract-OCR\\tesseract.EXE', '-l', 'por', '-c', 'textonly_pdf=1', 'C:\\Users\\eduar\\AppData\\Local\\Temp\\ocrmypdf.io.216t7n56\\000003_ocr.png', 'C:\\Users\\eduar\\AppData\\Local\\Temp\\ocrmypdf.io.216t7n56\\000003_ocr_tess', 'pdf', 'txt']
4 stdout/stderr = unpaper 6.2
License GPLv2: GNU GPL version 2. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Processing sheet #1: C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000004_pp_deskew.png -> C:\Users\eduar\AppData\Local\Temp\tmp4owguc1k\output.pnm input-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000004_pp_deskew.png output-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\tmp4owguc1k\output.pnm sheet size: 2480x3509 ... noise-filter ... deleted 0 clusters. blur-filter... deleted 0 pixels. writing output.
4 resolution (299.9994, 299.9994)
4 convert
4 PIL format = PNG
4 imgformat = PNG
4 input dpi = 300 x 300
4 rotation = 0°
4 input colorspace = 1
4 width x height = 2480px x 3509px
4 read_images() embeds a PNG
4 convert done
4 Running: ['C:\\Tesseract-OCR\\tesseract.EXE', '-l', 'por', '-c', 'textonly_pdf=1', 'C:\\Users\\eduar\\AppData\\Local\\Temp\\ocrmypdf.io.216t7n56\\000004_ocr.png', 'C:\\Users\\eduar\\AppData\\Local\\Temp\\ocrmypdf.io.216t7n56\\000004_ocr_tess', 'pdf', 'txt']
5 stdout/stderr = unpaper 6.2
License GPLv2: GNU GPL version 2. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Processing sheet #1: C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000005_pp_deskew.png -> C:\Users\eduar\AppData\Local\Temp\tmpba2c5vul\output.pnm input-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.216t7n56\000005_pp_deskew.png output-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\tmpba2c5vul\output.pnm sheet size: 2480x3509 ... noise-filter ... deleted 0 clusters. blur-filter... deleted 0 pixels. writing output.
5 resolution (299.9994, 299.9994)
5 convert
5 PIL format = PNG
5 imgformat = PNG
5 input dpi = 300 x 300
5 rotation = 0°
5 input colorspace = 1
5 width x height = 2480px x 3509px
5 read_images() embeds a PNG
5 convert done
5 Running: ['C:\\Tesseract-OCR\\tesseract.EXE', '-l', 'por', '-c', 'textonly_pdf=1', 'C:\\Users\\eduar\\AppData\\Local\\Temp\\ocrmypdf.io.216t7n56\\000005_ocr.png', 'C:\\Users\\eduar\\AppData\\Local\\Temp\\ocrmypdf.io.216t7n56\\000005_ocr_tess', 'pdf', 'txt']
OCR: 0%| | 0.0/6.0 [00:09<?, ?page/s] An exception occurred while executing the pipeline Traceback (most recent call last): File "C:\Python310\lib\site-packages\ocrmypdf_sync.py", line 393, in run_pipeline optimize_messages = exec_concurrent(context, executor) File "C:\Python310\lib\site-packages\ocrmypdf_sync.py", line 280, in exec_concurrent executor( File "C:\Python310\lib\site-packages\ocrmypdf_concurrent.py", line 87, in call self._execute( File "C:\Python310\lib\site-packages\ocrmypdf\builtin_plugins\concurrency.py", line 141, in _execute result = future.result() File "C:\Python310\lib\concurrent\futures_base.py", line 438, in result return self.get_result() File "C:\Python310\lib\concurrent\futures_base.py", line 390, in get_result raise self._exception File "C:\Python310\lib\concurrent\futures\thread.py", line 52, in run result = self.fn(*self.args, **self.kwargs) File "C:\Python310\lib\site-packages\ocrmypdf_sync.py", line 196, in exec_page_sync ocr_image, preprocess_out = make_intermediate_images( File "C:\Python310\lib\site-packages\ocrmypdf_sync.py", line 139, in make_intermediate_images preprocess_out = preprocess( File "C:\Python310\lib\site-packages\ocrmypdf_sync.py", line 108, in preprocess image = preprocess_remove_background(image, page_context) File "C:\Python310\lib\site-packages\ocrmypdf_pipeline.py", line 477, in preprocess_remove_background raise NotImplementedError("--remove-background is temporarily not implemented") NotImplementedError: --remove-background is temporarily not implemented
Apparently I think the problem is in this specific file, as others it ocerizes normally. Is it a bug in the file?
I updated python 310 and also ocrmypdf from 12 to 14, with that the command: --remove-background stopped working, I removed it and the problem disappeared.
That makes sense - unfortunately I still have not a chance to replace -remove-background.
When I try to OCR a specific file it shows the following error log:
ocrmypdf --force-ocr --optimize 0 --fast-web-view 0 --output-type pdf -l por -v 1 --deskew --remove-background --clean "D:\applications\dotNet\EasyMidia\TESTE_OCR\IN\PROCESSADO_NUANCE_317311740_1_1.PDF" "D:\applications\dotNet\EasyMidia\TESTE_OCR\OUT\REPROCESSADO_NUANCE_317311740_1_2_teste.PDF" [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado ocrmypdf 12.0.1 Running: ['C:\Tesseract-OCR\tesseract.EXE', '--list-langs'] stdout/stderr = List of available languages (166): afr amh ara asm aze aze_cyrl bel ben bod bos bre bul cat ceb ces chi_sim chi_sim_vert chi_tra chi_tra_vert chr cos cym dan dan_frak deu deu_frak div dzo ell eng enm epo equ est eus fao fas fil fin fra frk frm fry gla gle glg grc guj hat heb hin hrv hun hye iku ind isl ita ita_old jav jpn jpn_vert kan kat kat_old kaz khm kir kmr kor kor_vert lao lat lav lit ltz mal mar mkd mlt mon mri msa mya nep nld nor oci ori osd pan pol por pus que ron rus san script/Arabic script/Armenian script/Bengali script/Canadian_Aboriginal script/Cherokee script/Cyrillic script/Devanagari script/Ethiopic script/Fraktur script/Georgian script/Greek script/Gujarati script/Gurmukhi script/HanS script/HanS_vert script/HanT script/HanT_vert script/Hangul script/Hangul_vert script/Hebrew script/Japanese script/Japanese_vert script/Kannada script/Khmer script/Lao script/Latin script/Malayalam script/Myanmar script/Oriya script/Sinhala script/Syriac script/Tamil script/Telugu script/Thaana script/Thai script/Tibetan script/Vietnamese sin slk slk_frak slv snd spa spa_old sqi srp srp_latn sun swa swe syr tam tat tel tgk tgl tha tir ton tur uig ukr urd uzb uzb_cyrl vie yid yor
Running: ['C:\unpaper\unpaper.EXE', '--version'] Found unpaper 6.2 Running: ['C:\Tesseract-OCR\tesseract.EXE', '--version'] Found tesseract 5.0.0-alpha.20210506 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '--version'] Found gs 9.54.0 Scanning contents: 0%| | 0/6 [00:00<?, ?page/s][WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6.55page/s] Using Tesseract OpenMP thread limit 1 Start processing 6 pages concurrently OCR: 0%| | 0.0/6.0 [00:00<?, ?page/s][WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado [WinError 2] O sistema não pode encontrar o arquivo especificado 2 Rasterize with pnggray, rotation 0 3 Rasterize with pngmono, rotation 0 1 Rasterize with png16m, rotation 0 4 Rasterize with pngmono, rotation 0 2 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=2', '-dLastPage=2', '-r99.943004x99.943004', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 3 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pngmono', '-dFirstPage=3', '-dLastPage=3', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 1 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=png16m', '-dFirstPage=1', '-dLastPage=1', '-r300.000000x300.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 4 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pngmono', '-dFirstPage=4', '-dLastPage=4', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 5 Rasterize with pngmono, rotation 0 6 Rasterize with pnggray, rotation 0 5 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pngmono', '-dFirstPage=5', '-dLastPage=5', '-r300.003562x300.003562', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 6 Running: ['C:\gs9.54.0\bin\gswin64c.EXE', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=6', '-dLastPage=6', '-r99.943004x99.943004', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', 'C:\Users\eduar\AppData\Local\Temp\ocrmypdf.io.ddmqbmck\origin.pdf'] 3 STREAM b'IHDR' 16 13 3 STREAM b'iCCP' 41 2296 3 iCCP profile name b'default_gray.icc' 3 Compression method 0 3 STREAM b'pHYs' 2349 9 3 STREAM b'tEXt' 2370 31 3 STREAM b'IDAT' 2413 8192 3 Rotating output by 0 2 STREAM b'IHDR' 16 13 2 STREAM b'iCCP' 41 2296 2 iCCP profile name b'default_gray.icc' 2 Compression method 0 2 STREAM b'pHYs' 2349 9 2 STREAM b'tEXt' 2370 31 2 STREAM b'IDAT' 2413 8192 2 Rotating output by 0 4 STREAM b'IHDR' 16 13 4 STREAM b'iCCP' 41 2296 4 iCCP profile name b'default_gray.icc' 4 Compression method 0 4 STREAM b'pHYs' 2349 9 4 STREAM b'tEXt' 2370 31 4 STREAM b'IDAT' 2413 8192 4 Rotating output by 0 6 STREAM b'IHDR' 16 13 6 STREAM b'iCCP' 41 2296 6 iCCP profile name b'default_gray.icc' 6 Compression method 0 6 STREAM b'pHYs' 2349 9 6 STREAM b'tEXt' 2370 31 6 STREAM b'IDAT' 2413 8192 6 Rotating output by 0 5 STREAM b'IHDR' 16 13 5 STREAM b'iCCP' 41 2296 5 iCCP profile name b'default_gray.icc' 5 Compression method 0 5 STREAM b'pHYs' 2349 9 5 STREAM b'tEXt' 2370 31 5 STREAM b'IDAT' 2413 8192 5 Rotating output by 0 3 background removal skipped on mono page 3 background removal skipped on mono page 4 background removal skipped on mono page 5 background removal skipped on mono page 3 STREAM b'IHDR' 16 13 3 STREAM b'pHYs' 41 9 3 STREAM b'IDAT' 62 8192 4 background removal skipped on mono page 3 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmpubwxtqst\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmpubwxtqst\output.pbm'] 5 background removal skipped on mono page 4 STREAM b'IHDR' 16 13 4 STREAM b'pHYs' 41 9 4 STREAM b'IDAT' 62 8192 6 STREAM b'IHDR' 16 13 6 STREAM b'pHYs' 41 9 6 STREAM b'IDAT' 62 8192 5 STREAM b'IHDR' 16 13 5 STREAM b'pHYs' 41 9 5 STREAM b'IDAT' 62 8192 4 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmpl5ohj40q\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmpl5ohj40q\output.pbm'] 6 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '99.943004', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm'] 2 STREAM b'IHDR' 16 13 2 STREAM b'pHYs' 41 9 2 STREAM b'IDAT' 62 8192 5 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '300.003562', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmpbh3g3se7\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmpbh3g3se7\output.pbm'] 2 Running: ['C:\unpaper\unpaper.EXE', '-v', '--dpi', '99.943004', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', 'C:\Users\eduar\AppData\Local\Temp\tmp4i1yrr38\input.pnm', 'C:\Users\eduar\AppData\Local\Temp\tmp4i1yrr38\output.pgm'] 6 stdout/stderr = unpaper 6.2 License GPLv2: GNU GPL version 2. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
Processing sheet #1: C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\input.pnm -> C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm input-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\input.pnm output-file for sheet 1: C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm sheet size: 826x1169 ... noise-filter ... deleted 107 clusters. blur-filter... deleted 0 pixels. writing output.
OCR: 0%| | 0.0/6.0 [00:02<?, ?page/s] An exception occurred while executing the pipeline multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "c:\python39\lib\shutil.py", line 616, in _rmtree_unsafe os.unlink(fullname) PermissionError: [WinError 32] O arquivo já está sendo usado por outro processo: 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "c:\python39\lib\tempfile.py", line 801, in onerror _os.unlink(path) PermissionError: [WinError 32] O arquivo já está sendo usado por outro processo: 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "c:\python39\lib\multiprocessing\pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 189, in exec_page_sync ocr_image, preprocess_out = make_intermediate_images( File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 158, in make_intermediate_images ocr_image = preprocess( File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 105, in preprocess image = preprocess_clean(image, page_context) File "c:\python39\lib\site-packages\ocrmypdf_pipeline.py", line 486, in preprocess_clean unpaper.clean( File "c:\python39\lib\site-packages\ocrmypdf_exec\unpaper.py", line 134, in clean run(input_file, output_file, dpi=dpi, mode_args=unpaper_args) File "c:\python39\lib\site-packages\ocrmypdf_exec\unpaper.py", line 100, in run raise SubprocessOutputError( File "c:\python39\lib\tempfile.py", line 826, in exit self.cleanup() File "c:\python39\lib\tempfile.py", line 830, in cleanup self._rmtree(self.name) File "c:\python39\lib\tempfile.py", line 812, in _rmtree _shutil.rmtree(name, onerror=onerror) File "c:\python39\lib\shutil.py", line 740, in rmtree return _rmtree_unsafe(path, onerror) File "c:\python39\lib\shutil.py", line 618, in _rmtree_unsafe onerror(os.unlink, fullname, sys.exc_info()) File "c:\python39\lib\tempfile.py", line 804, in onerror cls._rmtree(path) File "c:\python39\lib\tempfile.py", line 812, in _rmtree _shutil.rmtree(name, onerror=onerror) File "c:\python39\lib\shutil.py", line 740, in rmtree return _rmtree_unsafe(path, onerror) File "c:\python39\lib\shutil.py", line 599, in _rmtree_unsafe onerror(os.scandir, path, sys.exc_info()) File "c:\python39\lib\shutil.py", line 596, in _rmtree_unsafe with os.scandir(path) as scandir_it: NotADirectoryError: [WinError 267] O nome do diretório é inválido: 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm' """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 374, in run_pipeline exec_concurrent(context, executor) File "c:\python39\lib\site-packages\ocrmypdf_sync.py", line 271, in exec_concurrent executor( File "c:\python39\lib\site-packages\ocrmypdf_concurrent.py", line 82, in call self._execute( File "c:\python39\lib\site-packages\ocrmypdf\builtin_plugins\concurrency.py", line 132, in _execute for result in results: File "c:\python39\lib\multiprocessing\pool.py", line 870, in next raise value NotADirectoryError: [Errno 20] O nome do diretório é inválido: 'C:\Users\eduar\AppData\Local\Temp\tmpevvec3s4\output.pgm'
Test file: 317311740_1_1.PDF