Adding micromamba to an existing docker image documentation tip

wwood commented 9 months ago

Hi,

I wanted to create a small(ish) image from micromamba, which I managed to accomplish using a multi-stage build - in particular using

FROM scratch
COPY --from=0 / /

and got an image ~1.2GB in size.

At the end of my Dockerfile. Unfortunately this means the environment gets ruined somewhat. So I followed the instructions at https://micromamba-docker.readthedocs.io/en/latest/advanced_usage.html#adding-micromamba-to-an-existing-docker-image and that made it work again.

However, then the image was 2.2GB. This was due to these directives:

COPY --from=micromamba "$MAMBA_EXE" "$MAMBA_EXE"
COPY --from=micromamba /usr/local/bin/_activate_current_env.sh /usr/local/bin/_activate_current_env.sh
COPY --from=micromamba /usr/local/bin/_dockerfile_shell.sh /usr/local/bin/_dockerfile_shell.sh
COPY --from=micromamba /usr/local/bin/_entrypoint.sh /usr/local/bin/_entrypoint.sh
COPY --from=micromamba /usr/local/bin/_dockerfile_initialize_user_accounts.sh /usr/local/bin/_dockerfile_initialize_user_accounts.sh
COPY --from=micromamba /usr/local/bin/_dockerfile_setup_root_prefix.sh /usr/local/bin/_dockerfile_setup_root_prefix.sh

which for me was superfluous - I'd already done COPY --from=0 / / but they added layers, causing the image size increase.

Maybe a comment can be made above these COPYs? Or even better, if we are interested in smaller images, then maybe an example Dockerfile could be given in the docs e.g.

FROM mambaorg/micromamba:1.5.6

# ... insert user-specific installs here

RUN micromamba clean -afy
FROM scratch
COPY --from=0 / /

# Actually unsure if this is needed?
USER root

# if your image defaults to a non-root user, then you may want to make the
# next 3 ARG commands match the values in your image. You can get the values
# by running: docker run --rm -it my/image id -a
ARG MAMBA_USER=mambauser
ARG MAMBA_USER_ID=57439
ARG MAMBA_USER_GID=57439
ENV MAMBA_USER=$MAMBA_USER
ENV MAMBA_ROOT_PREFIX="/opt/conda"
ENV MAMBA_EXE="/bin/micromamba"

# Actually unsure if these are needed?
# RUN /usr/local/bin/_dockerfile_initialize_user_accounts.sh && \
#     /usr/local/bin/_dockerfile_setup_root_prefix.sh

USER $MAMBA_USER

SHELL ["/usr/local/bin/_dockerfile_shell.sh"]

ENTRYPOINT ["/usr/local/bin/_entrypoint.sh"]
# Optional: if you want to customize the ENTRYPOINT and have a conda
# environment activated, then do this:
# ENTRYPOINT ["/usr/local/bin/_entrypoint.sh", "my_entrypoint_program"]

# You can modify the CMD statement as needed....
CMD ["/bin/bash"]

wholtz commented 9 months ago

Hello @wwood

It is unclear to me what you are trying to acomplish. It seems to me, that at the end of your first build stage, you had the image you wanted.

What do you gain from doing these two operations?

FROM scratch
COPY --from=0 / /

wwood commented 9 months ago

Thanks for the quick response. Doing that reduces the size of the image pretty dramatically for me (and I imagine for many/most others too) because it removes the layer history.

wholtz commented 9 months ago

I'd love to see complete examples.

Here is what I just tried:

FROM mambaorg/micromamba:1.5.6
RUN micromamba install -y -n base -c conda-forge \
       pyopenssl=20.0.1 \
       python=3.9.1 \
       requests=2.25.1 && \
    micromamba clean --all --yes

resulting in

REPOSITORY                                    TAG                  IMAGE ID        CREATED               PLATFORM          SIZE         BLOB SIZE
issue405                                      single_stage         3668a4d9e5f0    5 seconds ago         linux/arm64       278.3 MiB    80.0 MiB

and then I tried

FROM mambaorg/micromamba:1.5.6
RUN micromamba install -y -n base -c conda-forge \
       pyopenssl=20.0.1 \
       python=3.9.1 \
       requests=2.25.1 && \
    micromamba clean --all --yes

FROM scratch
COPY --from=0 / /

resulting in

REPOSITORY                                    TAG                  IMAGE ID        CREATED               PLATFORM          SIZE         BLOB SIZE
issue405                                      scratch              e70df0fe75dc    About a minute ago    linux/arm64       278.1 MiB    79.9 MiB

As expected, copying to scratch results in a very slightly smaller image. I imagine this is due to a reduction in the metadata for the layers. I don't find this size difference to be compelling.

wwood commented 9 months ago

Sure. The sizes don't quick match up to what I was saying above exactly because I simplified the dockerfile, but here's the dockerfile - sorry for the complexity remaining..

FROM mambaorg/micromamba:1.5.6

# Don't need all of the dependencies of singlem, because only pipe is going to be run.
COPY --chown=$MAMBA_USER:$MAMBA_USER env.yaml /tmp/env.yaml
RUN micromamba install -y -n base -f /tmp/env.yaml && \
    micromamba clean --all --yes

# (otherwise python will not be found)
ARG MAMBA_DOCKERFILE_ACTIVATE=1

# NOTE: The following 2 hashes should be changed in sync.
ENV SINGLEM_COMMIT b27c15b0
ENV SINGLEM_VERSION 0.16.0-dev4
RUN rm -rf singlem && git init singlem && cd singlem && git remote add origin https://github.com/wwood/singlem && git fetch origin && git checkout $SINGLEM_COMMIT
RUN echo '__version__ = "'$SINGLEM_VERSION.${SINGLEM_COMMIT}'"' >singlem/singlem/version.py

# Remove bundled singlem packages
RUN rm -rfv singlem/singlem/data singlem/.git singlem/test singlem/appraise_plot.png

RUN pip install --no-dependencies kingfisher graftm

# Diamond - go via direct because conda-forge version is likely slower on
# account of not being compiled appropriately. Also, the conda version installs
# BLAST, which takes up space and we don't need.
RUN cd /tmp && curl -L 'https://github.com/bbuchfink/diamond/releases/download/v2.1.8/diamond-linux64.tar.gz' -O 
RUN cd /tmp && \
    tar xf diamond-linux64.tar.gz && \
    cp diamond /opt/conda/bin/ && \
    rm diamond-linux64.tar.gz diamond

# Effectively add singlem to the PATH
RUN ln -s /tmp/singlem/bin/singlem /opt/conda/bin/singlem

RUN micromamba remove git -y
RUN micromamba clean -afy

# Test it out
# COPY --chown=$MAMBA_USER:$MAMBA_USER SRR8653040.sra /tmp/
# RUN singlem pipe --sra-files /tmp/SRR8653040.sra --no-assign-taxonomy --metapackage /mpkg --archive-otu-table /tmp/a.json --threads 4
# RUN rm /tmp/SRR8653040.sra /tmp/a.json

# Remove all the build dependencies / image layers for a smaller image overall
# FROM scratch
# COPY --from=0 / /

To build you need env.yaml

channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python>=3.7
  - biopython
  - hmmer
  - orfm
  - extern
  - sra-tools
  - ncbi-ngs-sdk
  - pip
  - pandas # hopefully not needed for pipe --no-assign-taxonomy
  - bird_tool_utils_python>=0.4.1
  - zenodo_backpack
  - sracat # usually installed via kingfisher, but we don't want all the kingfisher deps
  - sqlalchemy
  - git
  - aria2 >=1.36.0 # For kingfisher aws-http

Uncommenting the last 2 lines of the dockerfile changes the size from 1.8 to 1.0GB

REPOSITORY                                                                         TAG                             IMAGE ID       CREATED          SIZE
<none>                                                                             <none>                          fe55dfea5340   33 seconds ago   1.01GB
<none>                                                                             <none>                          1f7fb38237a7   50 seconds ago   1.84GB

wholtz commented 9 months ago

I can't get your image to build:

180.4 error    libmamba response code: -1 error message: Invalid argument
180.4 critical libmamba failed to execute pre/post link script for sra-tools

This may be because I'm on a mac with an ARM processor, so I am using emulation, as the bioconda packages are amd64 only.

But I'm pretty sure much of what you are seeing is due to how you setup the layers. Adding files in one RUN ... command and then deleting them in another RUN ... command is going to bloat your image. Combine them into a single RUN .. and the file that you added and then removed will not contribute to your layers. Your use of

FROM scratch
COPY --from=0 / /

is effectively cleaning up the inefficencies you generated by adding and then deleting files in separate layers.

wwood commented 9 months ago

Yes, imagine you are correct, that would probably work too, but only in limited circumstances.

For instance it isn't possible to COPY a file in, test that the program works, and then delete that file. It is also just annoying develop a dockerfile with tens of && entries, because iterating takes longer since layers can't be reused.

The last two lines feels like a more general solution to me (after adding the extras mentioned in my initial comment).

However, this is just my 2c - I only raised this issue as a suggestion, so feel free to ignore. Thanks for the great work with micromamba-docker.

maresb commented 9 months ago

@wwood I understand your frustration about the clunkiness of batching commands with && and being unable to delete files after they've been added.

For instance it isn't possible to COPY a file in, test that the program works, and then delete that file.

It's possible to add a test stage to your Dockerfile so that the test stuff isn't in your main stage. Alternatively, you could run the tests in a separate step after building.

But ultimately the scripts in question that you're pointing out are only a few kilobytes, so there's no way they're causing your images to become large.

wwood commented 9 months ago

It's possible to add a test stage to your Dockerfile so that the test stuff isn't in your main stage. Alternatively, you could run the tests in a separate step after building.

Didn't know this - thanks for the tip.

mamba-org / micromamba-docker

Adding micromamba to an existing docker image documentation tip #405