sdcb / PaddleSharp

.NET/C# binding for Baidu paddle inference library and PaddleOCR
Apache License 2.0
1.06k stars 199 forks source link

shutdown of the container #18

Closed TreF555 closed 2 years ago

TreF555 commented 2 years ago

Good afternoon we use the following dockerfile

`FROM mcr.microsoft.com/dotnet/aspnet:6.0-focal as base

ENV DEBIAN_FRONTEND=noninteractive ENV OPENCV_VERSION=4.6.0

WORKDIR /

Install opencv dependencies

RUN apt-get update && apt-get -y install --no-install-recommends \ apt-transport-https \ software-properties-common \ wget \ unzip \ ca-certificates \ build-essential \ cmake \ git \ libtbb-dev \ libatlas-base-dev \ libgtk2.0-dev \ libavcodec-dev \ libavformat-dev \ libswscale-dev \ libdc1394-22-dev \ libxine2-dev \ libv4l-dev \ libtheora-dev \ libvorbis-dev \ libxvidcore-dev \ libopencore-amrnb-dev \ libopencore-amrwb-dev \ libavresample-dev \ x264 \
libgdiplus \ tesseract-ocr \ tesseract-ocr-rus \ imagemagick \ libtesseract-dev \ apt-utils \ && apt-get -y clean \ && rm -rf /var/lib/apt/lists/*

Setup opencv and opencv-contrib source

RUN wget https://github.com/opencv/opencv/archive/${OPENCV_VERSION}.zip && \ unzip ${OPENCV_VERSION}.zip && \ rm ${OPENCV_VERSION}.zip && \ mv opencv-${OPENCV_VERSION} opencv && \ wget https://github.com/opencv/opencv_contrib/archive/${OPENCV_VERSION}.zip && \ unzip ${OPENCV_VERSION}.zip && \ rm ${OPENCV_VERSION}.zip && \ mv opencv_contrib-${OPENCV_VERSION} opencv_contrib

Build OpenCV

RUN cd opencv && mkdir build && cd build && \ cmake \ -D OPENCV_EXTRA_MODULES_PATH=/opencv_contrib/modules \ -D CMAKE_BUILD_TYPE=RELEASE \ -D BUILD_SHARED_LIBS=OFF \ -D ENABLE_CXX11=ON \ -D BUILD_EXAMPLES=OFF \ -D BUILD_DOCS=OFF \ -D BUILD_PERF_TESTS=OFF \ -D BUILD_TESTS=OFF \ -D BUILD_JAVA=OFF \ -D BUILD_opencv_app=OFF \ -D BUILD_opencv_barcode=OFF \ -D BUILD_opencv_java_bindings_generator=OFF \ -D BUILD_opencv_js_bindings_generator=OFF \ -D BUILD_opencv_python_bindings_generator=OFF \ -D BUILD_opencv_python_tests=OFF \ -D BUILD_opencv_ts=OFF \ -D BUILD_opencv_js=OFF \ -D BUILD_opencv_bioinspired=OFF \ -D BUILD_opencv_ccalib=OFF \ -D BUILD_opencv_datasets=OFF \ -D BUILD_opencv_dnn_objdetect=OFF \ -D BUILD_opencv_dpm=OFF \ -D BUILD_opencv_fuzzy=OFF \ -D BUILD_opencv_gapi=OFF \ -D BUILD_opencv_intensity_transform=OFF \ -D BUILD_opencv_mcc=OFF \ -D BUILD_opencv_objc_bindings_generator=OFF \ -D BUILD_opencv_rapid=OFF \ -D BUILD_opencv_reg=OFF \ -D BUILD_opencv_stereo=OFF \ -D BUILD_opencv_structured_light=OFF \ -D BUILD_opencv_surface_matching=OFF \ -D BUILD_opencv_videostab=OFF \ -D BUILD_opencv_wechat_qrcode=ON \ -D WITH_GSTREAMER=OFF \ -D WITH_ADE=OFF \ -D OPENCV_ENABLE_NONFREE=ON \ .. && make -j$(nproc) && make install && ldconfig

Download OpenCvSharp

RUN git clone https://github.com/shimat/opencvsharp.git && cd opencvsharp

Install the Extern lib.

RUN mkdir /opencvsharp/make && cd /opencvsharp/make && \ cmake -D CMAKE_INSTALL_PREFIX=/opencvsharp/make /opencvsharp/src && \ make -j$(nproc) && make install && \ rm -rf /opencv && \ rm -rf /opencv_contrib && \ cp /opencvsharp/make/OpenCvSharpExtern/libOpenCvSharpExtern.so /usr/lib/

set noninteractive installation

RUN export DEBIAN_FRONTEND=noninteractive

install tzdata package

RUN apt-get install -y tzdata

set your timezone

RUN ln -fs /usr/share/zoneinfo/Europe/Moscow /etc/localtime RUN dpkg-reconfigure --frontend noninteractive tzdata RUN echo Europe/Moscow > /etc/timezone

RUN apk del tzdata

RUN wget -q https://paddle-inference-lib.bj.bcebos.com/2.3.2/cxx_c/Linux/CPU/gcc8.2_avx_mkl/paddle_inference_c.tgz && \ tar -xzf /paddle_inference_c.tgz && \ find /paddle_inference_c -mindepth 2 -name .so -print0 | xargs -0 -I {} mv {} /usr/lib && \ ls /usr/lib/.so && \ rm -rf /paddle_inference_c && \ rm paddle_inference_c.tgz

FROM mcr.microsoft.com/dotnet/sdk:6.0-focal AS build

WORKDIR /src COPY ["Services/Image/Services.Image/Services.Image.csproj", "Services/Image/Services.Image/"] COPY ["Core/Common/Common.csproj", "Core/Common/"] RUN dotnet restore "Services/Image/Services.Image/Services.Image.csproj" COPY . . WORKDIR "/src/Services/Image/Services.Image" RUN dotnet build "Services.Image.csproj" -c Release -o /app/build

FROM build AS publish RUN dotnet publish "Services.Image.csproj" -c Release -o /app/publish

FROM base AS final WORKDIR /app COPY --from=publish /app/publish .

WORKDIR /app/x64 RUN ln -s /usr/lib/x86_64-linux-gnu/libdl-2.31.so libdl.so RUN ln -s /usr/lib/x86_64-linux-gnu/liblept.so.5 liblept.so.5 RUN ln -s /usr/lib/x86_64-linux-gnu/liblept.so.5 libleptonica-1.80.0.so RUN ln -s /usr/lib/x86_64-linux-gnu/libtesseract.so.4.0.1 libtesseract41.so

WORKDIR /app ENTRYPOINT ["dotnet", "Services.Image.dll"]`

And we get a complete shutdown of the container when transferring a large file during text recognition. If you transfer a small file for processing, then everything works, the container does not "fall" There are errors in the container logs Error in boxClipToRectangle: box outside rectangle Error in pixScanForForeground: invalid box

error

1 how to reduce memory consumption while working? 2 how to prevent the "fall" of the container

sdcb commented 2 years ago

try lower the Detector MaxSize: all.Detector.MaxSize = 800; // default: 2048

sdcb commented 2 years ago

you Moscow? It shouldn't using ~40GB memory, can I check your Services.Image.dll?

TreF555 commented 2 years ago

I am from Moscow. I can't submit the whole code for verification. I can tell you how it works in general: 1 input pdf file with many pages 2 in parallel I transfer each page to a temporary file 3 each temporary file is passed for recognition, then everything is expected and the result of processing for all files is displayed. This machine is 24 cores 50 gigabytes of memory If you specify specific pages, for example 3,4,5,6, then the container does not crash and works, the problem arises if you process all the pages from the source file, and there may be 100 and 200 pages or more. When working in parallel, maxDegreeOfParallelism = Environment.ProcessorCount / 2 is now set, i.e., in principle, the server is half loaded.

sdcb commented 2 years ago

It seems you're not using MKLDNN library(maybe fallbacked into openblas), which only consume 1 thread when applying the OCR. The server is unlikely half loaded if you using MKLDNN:

  1. Uninstall the openblas NuGet library(don't required when not install)
  2. Install NuGet Package: https://www.nuget.org/packages/Sdcb.PaddleInference.runtime.win64.mkl
  3. Specify PaddleConfig.Defaults.UseMkldnn = true; at very begging of your code

After that, you can specify maxDegreeOfParallelism to lower values.

The main reason of OutOfMemory is 1 OCR job consumes very large of memory.

TreF555 commented 2 years ago

Thanks for the answer. While decided to reduce the number of page processing in parallel. At the same time, we indicate no more than 5 pages, the container is still holding, it does not "fall". Regarding the question "1 OCR job consumes very large of memory" will something be fixed?

sdcb commented 2 years ago

No, because it's consuming expected amount of memory, but you can lower the amount by specify all.Detector.MaxSize = 800 (default is 2048)

Or using openblas instead of mkldnn.