I setup paperless with the docker instructions. After install it worked fine on a few PDFs until I got to my vehicle registration. The document is entirely in English, but it seems to be detecting it as cat/ca which is not installed. Is there a setting to force the software to use only English, or just skip OCR instead of failing to process? I see this in the 0.3.3 changelog but don't see where to set the default language. "Timezone, items per page, and default language are now all configurable..." I have "PAPERLESS_OCR_LANGUAGES=" [set to blank] in the yml file used to install paperless.
Here's a snippet of the error. I can work on full logs if that would help, but I think the issue is it's somehow detecting another language and trying to ocr in that language even though I've specified not to ocr in any language other than English.
Processing sheet #1: /tmp/paperless/paperless-1kv2atz2/convert-0000.pnm -> /tmp/paperless/paperless-1kv2atz2/convert-0000.unpaper.pnm
[pgm_pipe @ 0x558c05dd90c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x558c05ddac40] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x558c05ddac40] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
Parsing for cat
Processing sheet #1: /tmp/paperless/paperless-1kv2atz2/convert-0000.unpaper.pnm -> /tmp/paperless/paperless-1kv2atz2/convert-0000.unpaper.unpaper.pnm
Processing sheet #1: /tmp/paperless/paperless-1kv2atz2/convert-0000.pnm -> /tmp/paperless/paperless-1kv2atz2/convert-0000.unpaper.pnm
[pgm_pipe @ 0x55dd25c170c0] [pgm_pipe @ 0x55ccf30aa0c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55dd25c18c40] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55dd25c18c40] Encoder did not produce proper pts, making some up.
[image2 @ 0x55ccf30abc40] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55ccf30abc40] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
Parsing for cat
PARSE FAILURE for /consume/Registration.pdf: The guessed language (ca) is not available in this instance of Tesseract.
I setup paperless with the docker instructions. After install it worked fine on a few PDFs until I got to my vehicle registration. The document is entirely in English, but it seems to be detecting it as cat/ca which is not installed. Is there a setting to force the software to use only English, or just skip OCR instead of failing to process? I see this in the 0.3.3 changelog but don't see where to set the default language. "Timezone, items per page, and default language are now all configurable..." I have "PAPERLESS_OCR_LANGUAGES=" [set to blank] in the yml file used to install paperless.
Here's a snippet of the error. I can work on full logs if that would help, but I think the issue is it's somehow detecting another language and trying to ocr in that language even though I've specified not to ocr in any language other than English.