Open lzw5399 opened 4 years ago
FYI
# Build stage
FROM golang:1.15 as builder
ENV GO111MODULE=on
RUN rm -rf /etc/apt/sources.list && \
echo "deb https://mirrors.tuna.tsinghua.edu.cn/debian/ buster main contrib non-free" >> /etc/apt/sources.list && \
apt-get update -qq
RUN apt-get install -y \
libleptonica-dev \
libtesseract-dev \
tesseract-ocr
RUN echo "Tesseract Version in Builder Stage:" >> /tess-versions && tesseract --version >> /tess-versions
# App stage
FROM ubuntu:20.04 as runner
COPY --from=builder /tess-versions /tess-versions
RUN rm -rf /etc/apt/sources.list && \
echo 'deb http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse'>>/etc/apt/sources.list && \
echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse'>>/etc/apt/sources.list && \
echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse'>>/etc/apt/sources.list && \
echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse'>>/etc/apt/sources.list && \
echo 'deb http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse'>>/etc/apt/sources.list
RUN apt-get update \
&& apt-get install -y \
libleptonica-dev \
libtesseract-dev \
tesseract-ocr \
mupdf \
mupdf-tools
RUN apt-get install -y \
tesseract-ocr-eng \
tesseract-ocr-chi-sim
RUN echo "\nTesseract Version in Runner Stage:" >> /tess-versions && tesseract --version >> /tess-versions
CMD ["cat", "/tess-versions"]
│ [issue-205] Tesseract Version in Builder Stage:
│ [issue-205] tesseract 4.0.0
│ [issue-205] leptonica-1.76.0
│ [issue-205] libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.2) : libpng 1.6.36 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
│ [issue-205] Found AVX2
│ [issue-205] Found AVX
│ [issue-205] Found SSE
│ [issue-205]
│ [issue-205] Tesseract Version in Runner Stage:
│ [issue-205] tesseract 4.1.1
│ [issue-205] leptonica-1.79.0
│ [issue-205] libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 2.0.3) : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.1
│ [issue-205] Found AVX2
│ [issue-205] Found AVX
│ [issue-205] Found FMA
│ [issue-205] Found SSE
│ [issue-205] Found libarchive 3.4.0 zlib/1.2.11 liblzma/5.2.4 bz2lib/1.0.8 liblz4/1.9.2 libzstd/1.4.4
Hi, I Have This Exact Issue And Will Be Happy To Provide Any Information Needed To Find A Solution.
Like OP, I'm Using go-fitz To Convert The PDF To Image, Then Feeding It To Gosseract.
I am running into this issue as well. I am also using go-fitz. Would love to see a solution. Edit: Workaround is to just encode it to a PNG instead of JPG with go-fitz.
Using PNG worked for me also. thanks @Trey2k
Summary
Hi, I wrote a ocrserver based on
gosseract
(frontend page based onhttps://github.com/otiai10/ocrserver
), below description all can be found in https://github.com/lzw5399/ocrserver.There is a demand that OCR the received pdf base64 string, so I add the https://github.com/gen2brain/go-fitz depandency to convert the pdf to image(png) page by page, then use gosseract recognize the image. but after
go-fizt
was added I found the jpeg related functionality didn't work well, It seems have version conflict. Thanks in advance for any help ^-^client.SetImageFromBytes(bytes)
will return below errordocker exec
, and runfind -name '*libjpeg*'
I can't find anylibjpeg 90
related files. It's really confused meReproducibility
Reproducibility Frequency
Reproducible Dockerfile
Otherwise, describe how to reproduce
foo bar
spam ham
hoge fuga
Environment