waleedka / modern-deep-learning-docker

Modern Deep Learning Docker Image
MIT License
154 stars 55 forks source link

FYI, clean and autoremove #4

Open sberryman opened 6 years ago

sberryman commented 6 years ago

https://github.com/waleedka/modern-deep-learning-docker/blob/36ae632f5b90af34458e196aea52406799139b93/Dockerfile#L133-L134

While it would be great if this reduced the image size, it has zero effect as it is on a different layer. The only way this would be of use is if it is combined with a RUN command where you are actually installing or updating the repository.

In reality you would need to combine all apt-get update and install commands in a single RUN layer where you then clean up at the end of that single command. Considering this is for research and testing it isn't a big deal. Just figured I would point it out.

waleedka commented 6 years ago

@sberryman Thank you for pointing this out. You're right, using apt-get clean in a separate layer is useless. I looked into this further, and it turns out the standard Ubuntu image automatically calls clean and autoremove after every install, as mentioned in the official documentation. So I removed my explicit call.

Your second suggestion is correct as well. But instead of calling clean (which is being called automatically anyway), the best practice is to delete apt-get cache using rm -rf /var/lib/apt/lists/*. Adding this line in every RUN is ugly. And according to docker history, the apt-get cache adds 40MB, which is about 1% of the total image size, so it's not worth reducing the readability and flexibility of the code.

I did merge related RUN blocks together, keeping a balance between optimization and readability. I'll keep this issue open for any future suggestions or to correct me if any of my conclusions are wrong. Thanks again for the tips.

sberryman commented 6 years ago

Looking a lot better, here are few more things I noticed:

  1. You indicate you are installing tensorflow 1.6.0 on line 46 but are not specifying a version. As of right now you would be installing v1.6, but that is up to google's release schedule. You may be in for a bit of a surprise if you try and build the day they push 1.7 for example. Something more specific like https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.6.0-cp27-none-linux_x86_64.whl would probably be a better idea. This will help ensure a consistent build
  2. Lines 60 and 61 for OpenCV install are pulling from a git branch and on two layers, why not pull down the release tarball? See an example 1 below.
  3. Same thing on line 81 for caffe, pulling master is always risky unless you are looking for a bleeding edge image that will break.
  4. Line 111 you are pulling pytorch via http not https and not validating a hash, while I get this is for research, running this in production (which some people may be doing) leaves you open for an easy man-in-the-middle attack. Example of checking shasum is shown below. The other benefit I've found of using environment variables for build is that it makes it easy to see dependency versions in the built image. A simple docker inspect will show the variables. I've seen a mix of env and checksum in the layer being built or all placed near the top of the Docker file.

Example 1: (haven't actually tested this, more pseudo-code)

RUN export OPENCV_VERSION=3.4.1 && \
    export OPENCV_CHECKSUM=f1b87684d75496a1054405ae3ee0b6573acaf3dad39eaf4f1d66fdd7e03dc852 && \
    curl --retry 7 --fail -vo /tmp/opencv.tar.gz "https://codeload.github.com/opencv/opencv/tar.gz/${OPENCV_VERSION}" && \
    echo "${OPENCV_CHECKSUM}  /tmp/opencv.tar.gz" | sha256sum -c && \
    tar -zxf /tmp/opencv.tar.gz -C /usr/local/src/opencv && \
    rm /tmp/opencv.tar.gz && \
    cd /usr/local/src/opencv && \
    mkdir build && && \
    cd build && \
    cmake -D CMAKE_INSTALL_PREFIX=/usr/local \
          -D BUILD_TESTS=OFF \
          -D BUILD_PERF_TESTS=OFF \
          -D PYTHON_DEFAULT_EXECUTABLE=$(which python3) \
          .. && \
    make -j"$(nproc)" && \
    make install

One last note, not doing an apt-get update and cleaning up on each layer you are installing leaves you open for inconsistent builds. Right now you are updating the sources in (basically) the first layer. So once you build the image locally each layer is cached. If you make a change at line 13, any successive layers will be rebuilt, however, they will not be using an updated sources as that is from the cached layer. I haven't come across this being a problem in any production deployments I have but something to be cognizant of.