ocrmypdf / OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
http://ocrmypdf.readthedocs.io/
Mozilla Public License 2.0
13.67k stars 997 forks source link

How to add languages for tesseract-ocr in the image? #33

Closed xiongyw closed 8 years ago

xiongyw commented 8 years ago

Sorry I am new to docker. I just pull the latest, and want to use language chi_sim in tesseract, but it seems this language support is not installed by default, as it complains:

~/work/tmp$ docker run -v "$(pwd):/home/docker" ocrmypdf 31.pdf 31-ocr.pdf -l chi_sim The installed version of tesseract does not have language data for the following requested languages: chi_sim

It seems the tesseract used by the docker image is different from the system's tesseract-ocr package, with which I installed the language package by "apt-get install tesseract-ocr-chi-sim".

How to update the docker image for including the desired language support? And how to check which languages are supported (like "tesseract --list-langs" in the system)?

Thanks a lot.

jbarlow83 commented 8 years ago

I'll add more languages next time I update ocrmypdf.

The Dockerfile specifies how the container was built. It provides its own copy of tesseract and will not use the one on your machine, or anything else about your machine. It's like a lightweight virtual machine.

You can jump inside an ocrmypdf container, modify it, and save the changes as your own private image. (A container is an instance of image.)

In your case it would go something like this (not tested, made up on the spot):

$ docker run -t -i ocrmypdf /bin/bash
root@container:/# apt-get install tesseract-ocr-chi-sim
root@container:/# exit
$ docker commit -m "Added Chinese simplified" -a "Your Name"

See here: https://docs.docker.com/engine/userguide/dockerimages/

jbarlow83 commented 8 years ago

I decided to produce a second version of the container which provides all Tesseract's languages.

You can use this command to download it. Then Chinese (Simplified and Traditional) will be available.

docker pull jbarlow83/ocrmypdf-polyglot