Closed MrEaoP closed 5 years ago
I have the some issue with (only) some documents.
The guessed language (de) is not available in this instance of Tesseract."
But deu
is installed.
Anyone knows why it is detecting a two-letter language code instead the required three-letter one?
same issue here with fr and fra ...
Okay, got it.
The error message is a little bit misleading because of this line
https://github.com/the-paperless-project/paperless/blob/6e115bf2e6261f37dd92d6027c7e0c87c0d1007c/src/paperless_tesseract/parsers.py#L208
Its missing an conversion with ISO639[guessed_language]
It seems it's the same issue as described in #406 .
After checking the consumer docker, i figured out there is no deu
installed even if its configurated in docker-compose.env
with PAPERLESS_OCR_LANGUAGES=deu
.
Thought its fixed with #413, hmm...
@MrEaoP @ahyear You could try to install your required language by hand like its described in #406 and test again.
I try to investigate a little bit deeper, but i'm not familiar with python, linux or docker
After some trial and error i can say its not the same issue as #406.
It seems that the problem is this line https://github.com/the-paperless-project/paperless/blob/6e115bf2e6261f37dd92d6027c7e0c87c0d1007c/scripts/docker-entrypoint.sh#L85
The command apk info
produce some warnings which will interpreted wrong.
I replaced it whit the following line and everything works as it should
if ! apk --no-cache info "$pkg" > /dev/null 2>&1; then
(Remember do rebuild the docker images like docker-compose up -d --build
)
Don't know if its the best solution, but works. Maybe i can get some feedback to this solution
I'll give it a try, but can't promise exactly when I will have the time.
i'll try it,
do I also have to add --no-cache to this line: https://github.com/the-paperless-project/paperless/blob/6e115bf2e6261f37dd92d6027c7e0c87c0d1007c/scripts/docker-entrypoint.sh#L82 ?
i'll try it,
do I also have to add --no-cache to this line:
paperless/scripts/docker-entrypoint.sh
Line 82 in 6e115bf if apk info -e "$pkg" > /dev/null 2>&1; then
?
No, only edit the line i mentioned. Line 82 checks if the desired package is already installed which works fine on my machine.
Ok I tried and get: tesseract --list-langs List of available languages (4): eng equ osd fra
Alright, no more errors in the log, and it seems to work! It also scanned a document in Swedish, without a hitch. Many thanks @schwabelbauch !
Yes, Liek that title says, I drop a pdf in Swedish, and it can't be added.
docker-compose.env has the line "PAPERLESS_OCR_LANGUAGES=swe eng"
and so does the web server via the file docker-compose.yml
The log says;
"OCRing the document Parsing for eng Parsing for swe PARSE FAILURE for /consume/filename.pdf: The guessed language (sv) is not available in this instance of Tesseract."
"sv" = "Swe" = Swedish, in case you were wondering...