tleyden / open-ocr

Run your own OCR-as-a-Service using Tesseract and Docker
Apache License 2.0
1.33k stars 223 forks source link

Any plan to update to Tesseract 4.0? #83

Closed chavenor closed 7 years ago

chavenor commented 7 years ago

Is there any plan to update Tesseract 4.0?

tleyden commented 7 years ago

Yes. Do you know if the Tesseract team is providing docker images?

chavenor commented 7 years ago

I do not know if they are going to provide docker images. I may be wrong on this one but the CPU says it's 10x more than the current 3.x version. If that is true will we see a super slow down while running it inside a docker container?

I'm also wondering how the training processes will work inside a container if the hardware changes?

Thoughts?

tleyden commented 7 years ago

It works on pre-trained data, so the training process shouldn't be an issue.

I think the best approach would be to be able to switch between either tesseract 3 or 4 and let the user specify it somehow.

chavenor commented 7 years ago

Agree I think that would likely cover all past and foreseeable future use cases.

speedfl commented 7 years ago

Hello.

In https://github.com/tesseract-ocr/tesseract/issues/817

A guy propose a docker image for 4.0.0

https://hub.docker.com/r/xlight/docker-tesseract4/~/dockerfile/

Regards

tleyden commented 7 years ago

ok thanks for the heads up

speedfl commented 7 years ago

Just for information I created an issue on tesseract https://github.com/tesseract-ocr/tesseract/issues/893 because I was not able to have it working.

I will let you know once I will have an answer

speedfl commented 7 years ago

I dit a try with base64 and tesseract 4. Men it rocks (a little bit longer than tesseract 3) but were I had approximatively 60% of results success I had 100% with new version of tesseract.

I give you the dockerfile:

FROM ubuntu
RUN apt-get update && apt-get install -y \
    autoconf \
    automake \
    libtool \
    autoconf-archive \
    pkg-config \
    libpng12-dev \
    libjpeg8-dev \
    libtiff5-dev \
    zlib1g-dev \ 
    libicu-dev \
    libpango1.0-dev \
    libcairo2-dev \
    git \
    golang \
    gcc \
    curl && \
    rm -rf /var/lib/apt/lists/*

RUN curl http://www.leptonica.org/source/leptonica-1.74.1.tar.gz -o leptonica-1.74.1.tar.gz && \
    tar -zxvf leptonica-1.74.1.tar.gz && \
    cd leptonica-1.74.1 && ./configure && make && make install && \
    cd .. && rm -rf leptonica*

RUN git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git && \
    cd tesseract && \
    ./autogen.sh && \
    ./configure && \
    LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make && \
    make install && \
    ldconfig && \
    make training && \
    make training-install && \
    cd .. && rm -rf tesseract

# Get basic traineddata
RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata && \
    mv eng.traineddata /usr/local/share/tessdata/

RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/fra.traineddata && \
    mv fra.traineddata /usr/local/share/tessdata/

# go get open-ocr
RUN go get -u -v -t github.com/tleyden/open-ocr

# build open-ocr-httpd binary and copy it to /usr/bin
RUN cd $GOPATH/src/github.com/tleyden/open-ocr/cli-httpd && go build -v -o open-ocr-httpd && cp open-ocr-httpd /usr/bin

# build open-ocr-worker binary and copy it to /usr/bin
RUN cd $GOPATH/src/github.com/tleyden/open-ocr/cli-worker && go build -v -o open-ocr-worker && cp open-ocr-worker /usr/bin

If we want to have all the languages we can replace:

# Get basic traineddata
RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata && \
    mv eng.traineddata /usr/local/share/tessdata/

RUN curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/fra.traineddata && \
    mv fra.traineddata /usr/local/share/tessdata/

With:

git clone https://github.com/tesseract-ocr/tessdata && \
mv  -v tessdata/* /usr/local/share/tessdata/ && \
rm -rf tessadata

Now we should find a way to tell to docker-compose to use tesseract3 or tesseract4 based on the choice of the guy.

You could maybe create a docker file named tleyden5iwx/open-ocr-4

Something should change here: https://github.com/tleyden/open-ocr/blob/master/docker-compose/docker-compose.yml#L25 https://github.com/tleyden/open-ocr/blob/master/docker-compose/docker-compose.yml#L35

With an environment variable like:

 openocrworker:
    image: tleyden5iwx/{$OCR_VESION}
    volumes:
      - ./scripts/:/opt/open-ocr/
    dns: ["8.8.8.8"]
    depends_on:
      - rabbitmq
    command: "/opt/open-ocr/open-ocr-worker -amqp_uri amqp://admin:Phaish9ohbaidei6oole@rabbitmq/"

If you want I can make a try. However I don't know how to upload a docker file with dockerhub....

tleyden commented 7 years ago

but were I had approximatively 60% of results success I had 100% with new version of tesseract.

Wow!!

tleyden commented 7 years ago

I give you the dockerfile

Can you open a PR that adds that dockerfile to this repo? It should moved to this repo rather than it's current location: https://github.com/tleyden/docker/blob/master/open-ocr/Dockerfile

Now we should find a way to tell to docker-compose to use tesseract3 or tesseract4 based on the choice of the guy.

Yep that makes sense

speedfl commented 7 years ago

I will do some test and create a PR with everything (the switch + the dockerfiles + a script to build the docker images)

speedfl commented 7 years ago

I continue my development. However it seems that I have an issue with docker compose now. I will let you know

tleyden commented 7 years ago

This was merged in https://github.com/tleyden/open-ocr/pull/90.