tleyden / docker

Docker files
82 stars 54 forks source link

Using unofficial Tesseract4 PPA instead of compiling it #13

Open minyk opened 7 years ago

minyk commented 7 years ago

Hi, @tleyden @speedfl

Thanks for Tesseract4 version of Open-OCR! I have some opinion for Tesseract4 Dockerfile. As mentioned in this issue title, how do you think to use an unofficial PPA for Tesseract4? The PPA is https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr and enable PPA and just install like Tesseract3:

RUN apt-get update && apt-get install -y software-properties-common && add-apt-repository ppa:alex-p/tesseract-ocr && apt-get update

# Get tesseract-ocr packages
RUN apt-get install -y \
  libleptonica-dev \
  libtesseract4 \
  libtesseract-dev \
  tesseract-ocr

# Get language data.
RUN apt-get install -y \
  tesseract-ocr-ara \
  tesseract-ocr-bel \
  tesseract-ocr-ben \
  tesseract-ocr-bul \
  tesseract-ocr-ces \
  tesseract-ocr-dan \
  tesseract-ocr-deu \
  tesseract-ocr-ell \
  tesseract-ocr-fin \
  tesseract-ocr-fra \
  tesseract-ocr-heb \
  tesseract-ocr-hin \
  tesseract-ocr-ind \
  tesseract-ocr-isl \
  tesseract-ocr-ita \
  tesseract-ocr-jpn \
  tesseract-ocr-kor \
  tesseract-ocr-nld \
  tesseract-ocr-nor \
  tesseract-ocr-pol \
  tesseract-ocr-por \
  tesseract-ocr-ron \
  tesseract-ocr-rus \
  tesseract-ocr-spa \
  tesseract-ocr-swe \
  tesseract-ocr-tha \
  tesseract-ocr-tur \
  tesseract-ocr-ukr \
  tesseract-ocr-vie \
  tesseract-ocr-chi-sim \
  tesseract-ocr-chi-tra \
  tesseract-ocr-eng

In this way, we don't install dev packages on Docker image.

Thanks.

speedfl commented 7 years ago

Hello,

I first thought about using it. However if you look at the package details: https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr/+packages

You can see that the unoficial package is not building on both version (xenial trusty and so on). In addition I was not able to make it works with ppa:alex-p/tesseract-ocr.

Did you make a try?

I am following it closely and if you have a proposition it is welcome :) because it will reduce the build signicantly.

minyk commented 7 years ago

Hi, @speedfl

Actually I built my own tesseract4 images at March with this configuration: https://github.com/minyk/open-ocr/blob/feature/tesseract4.00alpha/docker-compose/open-ocr/Dockerfile Any problem did not occur during docker build at that time.

I rebuild image today and tesseract4 is installed with tesseract - 4.00~git1851-10e04ff-1ppa1~xenial1. Maybe the current published version's build was broken so apt-get install older one.

speedfl commented 7 years ago

Ok thx for your help. I will make a try this evening with your dockerfile and if it is working I will create a PR

speedfl commented 7 years ago

Hello @minyk

It seems to work.

So you can proceed to a PR :)

A recap of the Dockerfile

FROM ubuntu

ENV GOPATH /opt/go

# Get git golang and gcc packages
RUN apt-get update && apt-get install -y \
software-properties-common \
git \
golang \
gcc 

RUN add-apt-repository ppa:alex-p/tesseract-ocr && apt-get update

# Get tesseract-ocr packages
RUN apt-get install -y \
  libleptonica-dev \
  libtesseract4 \
  libtesseract-dev \
  tesseract-ocr

# Get language data.
RUN apt-get install -y \
  tesseract-ocr-ara \
  tesseract-ocr-bel \
  tesseract-ocr-ben \
  tesseract-ocr-bul \
  tesseract-ocr-ces \
  tesseract-ocr-dan \
  tesseract-ocr-deu \
  tesseract-ocr-ell \
  tesseract-ocr-fin \
  tesseract-ocr-fra \
  tesseract-ocr-heb \
  tesseract-ocr-hin \
  tesseract-ocr-ind \
  tesseract-ocr-isl \
  tesseract-ocr-ita \
  tesseract-ocr-jpn \
  tesseract-ocr-kor \
  tesseract-ocr-nld \
  tesseract-ocr-nor \
  tesseract-ocr-pol \
  tesseract-ocr-por \
  tesseract-ocr-ron \
  tesseract-ocr-rus \
  tesseract-ocr-spa \
  tesseract-ocr-swe \
  tesseract-ocr-tha \
  tesseract-ocr-tur \
  tesseract-ocr-ukr \
  tesseract-ocr-vie \
  tesseract-ocr-chi-sim \
  tesseract-ocr-chi-tra \
  tesseract-ocr-eng

RUN mkdir -p $GOPATH

# go get open-ocr
RUN go get -u -v -t github.com/tleyden/open-ocr

# build open-ocr-httpd binary and copy it to /usr/bin
RUN cd $GOPATH/src/github.com/tleyden/open-ocr/cli-httpd && go build -v -o open-ocr-httpd && cp open-ocr-httpd /usr/bin

# build open-ocr-worker binary and copy it to /usr/bin
RUN cd $GOPATH/src/github.com/tleyden/open-ocr/cli-worker && go build -v -o open-ocr-worker && cp open-ocr-worker /usr/bin