mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
https://mindee.github.io/doctr/
Apache License 2.0
3.31k stars 397 forks source link

Reduce the build size for inference #1593

Closed decadance-dance closed 1 month ago

decadance-dance commented 1 month ago

🚀 The feature

At the moment, installing all the dependencies for doctr/.[torch] takes up a lot of disk space. My final Docker image takes about 12GB, despite the fact that I run the service with only one model. I'm not sure that I need about 7GB of dependencies to infer one model. My suggestion is to add separate options like [torch-infer] or [tf-infer] that would install only those packages that are needed for inference. I think it might help if we don't need things for training and evaluation.

Motivation, pitch

I'd like to build lighter images for inference purposes, installing less dependencies. It will save disk space and reduce bulding time.

Alternatives

There may be dependencies that are installed but not used. If so, then removing them from the list of packages will help reduce the size of the builds.

Additional context

My typical docker image:

FROM nvcr.io/nvidia/cuda:12.1.1-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y --no-install-recommends \
    wget \
    git \
    # - Packages to run cv2
    ffmpeg \
    # - Packages to build Python
    tar make gcc zlib1g-dev libffi-dev libssl-dev liblzma-dev libbz2-dev libsqlite3-dev 

# Install Python
ARG PYTHON_VERSION=3.10.13
RUN wget http://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tgz && \
    tar -zxf Python-$PYTHON_VERSION.tgz && \
    cd Python-$PYTHON_VERSION && \
    mkdir /opt/python/ && \
    ./configure --prefix=/opt/python && \
    make && \
    make install && \
    cd .. && \
    rm Python-$PYTHON_VERSION.tgz && \
    rm -r Python-$PYTHON_VERSION
ENV PATH=/opt/python/bin:$PATH

RUN git clone https://github.com/mindee/doctr.git
RUN pip3 install -e doctr[torch]

# ...

# ENTRYPOINT ["tail"]
# CMD ["-f","/dev/null"]
felixdittrich92 commented 1 month ago

Hi @decadance-dance 👋, We have already split some dependencies in extras (will be available with the next release). You could also take a look at https://github.com/felixdittrich92/OnnxTR that's more optimized for plain inference :)

felixdittrich92 commented 1 month ago

1551

decadance-dance commented 1 month ago

Hi @felixdittrich92, thanks I have never seen the OnnxTR project before. I am gonna try it for sure.

felixdittrich92 commented 1 month ago

@decadance-dance yeah i worked last week a bit on it and released it on friday (public) ^^ Because there was some requests about an onnx pipeline and it's easier to keep it dedicated instead of blowing up docTR with a third "backend"