otiai10 / gosseract

Go package for OCR (Optical Character Recognition), by using Tesseract C++ library
https://pkg.go.dev/github.com/otiai10/gosseract
MIT License
2.71k stars 289 forks source link

Wrong JPEG library version: library is 90, caller expects 80 #205

Open lzw5399 opened 4 years ago

lzw5399 commented 4 years ago

Summary

Hi, I wrote a ocrserver based on gosseract (frontend page based on https://github.com/otiai10/ocrserver), below description all can be found in https://github.com/lzw5399/ocrserver.

There is a demand that OCR the received pdf base64 string, so I add the https://github.com/gen2brain/go-fitz depandency to convert the pdf to image(png) page by page, then use gosseract recognize the image. but after go-fizt was added I found the jpeg related functionality didn't work well, It seems have version conflict. Thanks in advance for any help ^-^

Reproducibility

Reproducibility Frequency

Reproducible Dockerfile

# build stage
FROM golang:1.15 as builder

ENV GO111MODULE=on \
    GOPROXY=https://goproxy.cn,direct

WORKDIR /app

COPY . .

RUN rm -rf /etc/apt/sources.list && \
    echo "deb https://mirrors.tuna.tsinghua.edu.cn/debian/ buster main contrib non-free" >> /etc/apt/sources.list && \
    apt-get update

RUN apt-get install -y \
    libleptonica-dev \
    libtesseract-dev \
    tesseract-ocr

RUN GOOS=linux GOARCH=amd64 go build .

RUN mkdir publish && cp bank-ocr publish && \
    cp -r app publish && mkdir publish/config && \
    cp config/appsettings.yaml publish/config/

FROM ubuntu:20.04

WORKDIR /app

COPY --from=builder /app/publish .

RUN rm -rf /etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse'>>/etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse'>>/etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse'>>/etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse'>>/etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse'>>/etc/apt/sources.list

RUN apt-get update \
  && apt-get install -y \
    libleptonica-dev \
    libtesseract-dev \
    tesseract-ocr \
    mupdf \
    mupdf-tools

RUN apt-get install -y \
  tesseract-ocr-eng \
  tesseract-ocr-chi-sim

ENV GIN_MODE=release \
    PORT=8080

EXPOSE 8080

ENTRYPOINT ["./bank-ocr"]

Otherwise, describe how to reproduce

  1. foo bar
  2. spam ham
  3. hoge fuga

Environment

otiai10 commented 3 years ago

FYI

# Build stage
FROM golang:1.15 as builder

ENV GO111MODULE=on
RUN rm -rf /etc/apt/sources.list && \
    echo "deb https://mirrors.tuna.tsinghua.edu.cn/debian/ buster main contrib non-free" >> /etc/apt/sources.list && \
    apt-get update -qq

RUN apt-get install -y \
    libleptonica-dev \
    libtesseract-dev \
    tesseract-ocr

RUN echo "Tesseract Version in Builder Stage:" >> /tess-versions && tesseract --version >> /tess-versions

# App stage
FROM ubuntu:20.04 as runner

COPY --from=builder /tess-versions /tess-versions

RUN rm -rf /etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse'>>/etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse'>>/etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse'>>/etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse'>>/etc/apt/sources.list && \
    echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse'>>/etc/apt/sources.list

RUN apt-get update \
  && apt-get install -y \
    libleptonica-dev \
    libtesseract-dev \
    tesseract-ocr \
    mupdf \
    mupdf-tools

RUN apt-get install -y \
  tesseract-ocr-eng \
  tesseract-ocr-chi-sim

RUN echo "\nTesseract Version in Runner Stage:" >> /tess-versions && tesseract --version >> /tess-versions

CMD ["cat", "/tess-versions"]
│ [issue-205] Tesseract Version in Builder Stage:
│ [issue-205] tesseract 4.0.0
│ [issue-205]  leptonica-1.76.0
│ [issue-205]   libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
│ [issue-205]  Found AVX2
│ [issue-205]  Found AVX
│ [issue-205]  Found SSE
│ [issue-205]
│ [issue-205] Tesseract Version in Runner Stage:
│ [issue-205] tesseract 4.1.1
│ [issue-205]  leptonica-1.79.0
│ [issue-205]   libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
│ [issue-205]  Found AVX2
│ [issue-205]  Found AVX
│ [issue-205]  Found FMA
│ [issue-205]  Found SSE
│ [issue-205]  Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
h4ckitt commented 2 years ago

Hi, I Have This Exact Issue And Will Be Happy To Provide Any Information Needed To Find A Solution.

Like OP, I'm Using go-fitz To Convert The PDF To Image, Then Feeding It To Gosseract.

Trey2k commented 1 year ago

I am running into this issue as well. I am also using go-fitz. Would love to see a solution. Edit: Workaround is to just encode it to a PNG instead of JPG with go-fitz.

fpinna commented 1 year ago

Using PNG worked for me also. thanks @Trey2k