the-paperless-project / paperless

Scan, index, and archive all of your paper documents
GNU General Public License v3.0
7.85k stars 498 forks source link

Consumer can't OCR Swedish document #528

Closed MrEaoP closed 5 years ago

MrEaoP commented 5 years ago

Yes, Liek that title says, I drop a pdf in Swedish, and it can't be added.

docker-compose.env has the line "PAPERLESS_OCR_LANGUAGES=swe eng"

and so does the web server via the file docker-compose.yml

The log says;

"OCRing the document Parsing for eng Parsing for swe PARSE FAILURE for /consume/filename.pdf: The guessed language (sv) is not available in this instance of Tesseract."

"sv" = "Swe" = Swedish, in case you were wondering...

schwabelbauch commented 5 years ago

I have the some issue with (only) some documents. The guessed language (de) is not available in this instance of Tesseract." But deu is installed.

Anyone knows why it is detecting a two-letter language code instead the required three-letter one?

ahyear commented 5 years ago

same issue here with fr and fra ...

schwabelbauch commented 5 years ago

Okay, got it.

The error message is a little bit misleading because of this line https://github.com/the-paperless-project/paperless/blob/6e115bf2e6261f37dd92d6027c7e0c87c0d1007c/src/paperless_tesseract/parsers.py#L208 Its missing an conversion with ISO639[guessed_language]

It seems it's the same issue as described in #406 . After checking the consumer docker, i figured out there is no deu installed even if its configurated in docker-compose.env with PAPERLESS_OCR_LANGUAGES=deu. Thought its fixed with #413, hmm...

@MrEaoP @ahyear You could try to install your required language by hand like its described in #406 and test again.

I try to investigate a little bit deeper, but i'm not familiar with python, linux or docker

schwabelbauch commented 5 years ago

After some trial and error i can say its not the same issue as #406.

It seems that the problem is this line https://github.com/the-paperless-project/paperless/blob/6e115bf2e6261f37dd92d6027c7e0c87c0d1007c/scripts/docker-entrypoint.sh#L85

The command apk info produce some warnings which will interpreted wrong. I replaced it whit the following line and everything works as it should if ! apk --no-cache info "$pkg" > /dev/null 2>&1; then (Remember do rebuild the docker images like docker-compose up -d --build)

Don't know if its the best solution, but works. Maybe i can get some feedback to this solution

MrEaoP commented 5 years ago

I'll give it a try, but can't promise exactly when I will have the time.

ahyear commented 5 years ago

i'll try it,

do I also have to add --no-cache to this line: https://github.com/the-paperless-project/paperless/blob/6e115bf2e6261f37dd92d6027c7e0c87c0d1007c/scripts/docker-entrypoint.sh#L82 ?

schwabelbauch commented 5 years ago

i'll try it,

do I also have to add --no-cache to this line:

paperless/scripts/docker-entrypoint.sh

Line 82 in 6e115bf if apk info -e "$pkg" > /dev/null 2>&1; then

?

No, only edit the line i mentioned. Line 82 checks if the desired package is already installed which works fine on my machine.

ahyear commented 5 years ago

Ok I tried and get: tesseract --list-langs List of available languages (4): eng equ osd fra

MrEaoP commented 5 years ago

Alright, no more errors in the log, and it seems to work! It also scanned a document in Swedish, without a hitch. Many thanks @schwabelbauch !