Open LuchiLucs opened 1 year ago
Please post the Dockerfiles :)
Here there are the two dockerfiles: Using micromamba:
# # micromamba --help through docker:
# 1) docker run -it mambaorg/micromamba:1.3.1-bullseye-slim /bin/sh
# 2) $ micromamba --help
#
#
ARG BASE_IMAGE=mambaorg/micromamba:1.3.1-bullseye-slim
FROM ${BASE_IMAGE}
# Copy dependencies list
COPY --chown=$MAMBA_USER:$MAMBA_USER enviroment.yaml /tmp/enviroment.yaml
USER root
# tools needed to build the Microsoft OBDC driver for Microsoft SQL server (requirements)
# https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server
RUN set -eux \
&& buildDeps=' \
gnupg \
curl \
gcc \
' \
&& apt-get update \
&& apt-get install -y --no-install-recommends $buildDeps \
&& curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
&& curl https://packages.microsoft.com/config/debian/11/prod.list > /etc/apt/sources.list.d/mssql-release.list \
&& apt-get update \
&& ACCEPT_EULA=Y apt-get install -y --no-install-recommends msodbcsql17 \
&& apt-get install -y --no-install-recommends unixodbc-dev \
# clean up
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get purge -y --auto-remove $buildDeps
# install app deps
USER $MAMBA_USER
RUN micromamba install --yes --name base --no-pyc --file /tmp/enviroment.yaml && \
micromamba clean --all --force-pkgs-dirs --yes
# Copy all app files
WORKDIR /app
COPY . .
CMD ["python", "app.py"]
Using pip:
ARG PYTHON_VERSION=3.9.16
ARG DEBIAN_VERSION=bullseye
ARG BASE_IMAGE=python:${PYTHON_VERSION}-slim-${DEBIAN_VERSION}
# Set virtual enviroment path
ARG VIRTUAL_ENV=/opt/venv
FROM ${BASE_IMAGE} as build
# Make the ARG variable defined outside a FROM statement available inside
# https://docs.docker.com/engine/reference/builder/#understand-how-arg-and-from-interact
ARG VIRTUAL_ENV
# tools needed to build psycopg from source:
# https://www.psycopg.org/docs/install.html#build-prerequisites
RUN apt-get update \
&& apt-get install -y --no-install-recommends gcc python3-dev libpq-dev apt-utils
# Copy dependencies list
COPY requirements.txt dependencies.txt
# Create virtual enviroment without bootstrapped pip
#!!! TODO: check if setuptools is bootstrapped, if yes then should be deleted to optimize image size
# https://docs.python.org/3/library/venv.html
RUN python -m venv --without-pip ${VIRTUAL_ENV}
# tools needed to build requirements from source:
# https://docs.scipy.org/doc//scipy-1.4.1/reference/building/linux.html
# https://numpy.org/doc/stable/user/building.html
RUN set -eux \
&& buildScietificPackagesDeps=' \
build-essential \
cmake \
ninja-build \
gfortran \
pkg-config \
python-dev \
libopenblas-dev \
liblapack-dev \
autoconf \
automake \
libatlas-base-dev \
# WIP: python-ply libffi-dev unixodbc-dev are needed to try building other pkgs from sources other than numpy and scipy
python-ply \
libffi-dev \
unixodbc-dev \
' \
&& apt-get update \
&& apt-get install -y --no-install-recommends $buildScietificPackagesDeps \
&& pip install --upgrade --no-cache-dir pip wheel setuptools Cython meson-python pythran pybind11
# Use virtual enviroment (persistent in the final image):
ENV PATH=$VIRTUAL_ENV/bin:$PATH
ENV PYTHONHOME=
# Install dependencies list
# --prefix
# used to install inside virtual enviroment path
# --no-cache-dir
# used to avoid using cache for packages (decrease image size)
# --no-compile
# used to avoid compiling python .py files to bytecode .pyc (decrease image size - bytecode is generated at first run when importing modules)
# # TODO: check if runtime performance is affected
# # TODO: check if import just the functions that are needed in src is a good practice
# --use-pep517 --check-build-dependencies --no-build-isolation
# used to solve https://github.com/pypa/pip/issues/8559
# "# DEPRECATION: psycopg2 is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed"
# --compile --global-option=build_ext --global-option=-g0 --global-option=-Wl
# used to pass flags to C compiler and compile to bytecode from source, see:
# https://towardsdatascience.com/how-to-shrink-numpy-scipy-pandas-and-matplotlib-for-your-data-product-4ec8d7e86ee4
# https://blog.mapbox.com/aws-lambda-python-magic-e0f6a407ffc6
#
# https://pip.pypa.io/en/stable/cli/pip_install/#options
RUN CFLAGS="-g0 -Wl,--strip-all" \
pip install --prefix=${VIRTUAL_ENV} --no-cache-dir --ignore-installed \
--requirement dependencies.txt \
--use-pep517 --no-build-isolation --config-settings="build_ext=-j4" \
--no-binary numpy,scipy \
&& pip cache purge
FROM ${BASE_IMAGE} as runtime
# runtime requirements of psycopg:
# https://www.psycopg.org/docs/install.html#runtime-requirements
RUN set -eux \
&& apt-get update \
&& apt-get install -y --no-install-recommends libpq5 libopenblas0 liblapack3 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
# tools needed to build the Microsoft OBDC driver for Microsoft SQL server (requirements)
# https://learn.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server
RUN set -eux \
&& buildDeps=' \
gnupg \
curl \
gcc \
' \
&& apt-get update \
&& apt-get install -y --no-install-recommends $buildDeps \
&& curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - \
&& curl https://packages.microsoft.com/config/debian/11/prod.list > /etc/apt/sources.list.d/mssql-release.list \
&& apt-get update \
&& ACCEPT_EULA=Y apt-get install -y --no-install-recommends msodbcsql17 \
&& apt-get install -y --no-install-recommends unixodbc-dev \
# clean up
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get purge -y --auto-remove $buildDeps
ARG VIRTUAL_ENV
WORKDIR /app
COPY . .
COPY --from=build ${VIRTUAL_ENV} ${VIRTUAL_ENV}
# Use virtual enviroment (persistent in the final image):
ENV PATH=$VIRTUAL_ENV/bin:$PATH
ENV PYTHONHOME=
# Executable to be run
CMD["python", "app.py"]
It looks like you're missing some packages in the pip build, eg. psycopg2
?
But in any case, it might very well be that Conda includes more optional dependencies because there's no optional dependencies in Conda, so most packages include optional dependencies.
No, psycopg2
is inside the requirements.txt file, its building and runtime deps are managed outside through apt-get.
If there is not way to install only required packages by means on conda/mamba/micromamba, my only solution is to stick with pip, is that right?
Yes
I wonder if it's worth it, your Dockerfiles are already really really complicated
I value the simplicity of conda/microconda a lot, that is why I wanted to give it a try, but the image size difference is huge, thanks anyway!
Would be interesting to run something like dive
: https://github.com/wagoodman/dive to see what teh large files are ...
@wolfv Thanks for the suggestion-this would be a good exercise. For my needs, however, it would be ideal to have a tool that allows you to build from scratch, inserting only what you need into the image and not the other way around, removing the extra. I think in the long run this process can lead to more inconsistencies and errors.
Troubleshooting docs
Search tried in issue tracker
docke image size compared to pip
Latest version of Mamba
Tried in Conda?
Reproducible with Conda
Describe your issue
I have successfully built two docker images of the same Python application, where I have a requirements.txt/enviroment.yml file containing the runtime depedencies. The first solution uses apt-get to install system requirerements and then pip to install the requirements of the application. The second solution uses just micromamba to resolve requirements and related depedencies. The first solution takes around 790MB in size, if I compile numpy and scipy from source, avoiding to bundle two copies of openblas library, for instance, I have save up another 40MB in size, resulting to a image of 750MB. The second solution uses micromamba and results in 2.2GB in size. Both solutions clear caches of installed packages (e.g. apt-get cache, pip cache, micromamba cache). What could cause this difference? With the first solution I use a multi-stage build in order to install requirements packages and then copy just those over the second stage of the image. Maybe micromamba bundle together building and runtime deps?
Would you mind helping me understanding how to proceded further and how to resolve this problem? I would like to use micromamba and still have similar sizes for the two docker images.
mamba info / micromamba info
Logs
environment.yml
~/.condarc