Use multistage Docker build for worker base docker

Panaetius commented 6 years ago

Currently the worker base image is around 3GB in size, meaning it takes quite long to install everything (longer than the 5 min default timeout of helm).

All the temporary files and build tools etc. are inside the docker, even though they are not needed later on.

We should do multistage builds, only copying what is needed from one stage to the next.

That should help reduce the size by a lot.

liehe commented 6 years ago

I run du -h -d 1 / in a container of image mlbench/mlbench_worker:mlbench-worker-base. The results are

32K     ./run
4.0K    ./opt
4.0K    ./tmp
50M     ./var
0       ./dev
0       ./sys
4.0K    ./boot
25M     ./lib
4.0K    ./home
du: cannot access './proc/23/task/23/fd/3': No such file or directory
du: cannot access './proc/23/task/23/fdinfo/3': No such file or directory
du: cannot access './proc/23/fd/4': No such file or directory
du: cannot access './proc/23/fdinfo/4': No such file or directory
12K     ./proc
4.0K    ./mnt
17M     ./root
3.5M    ./sbin
4.0K    ./media
2.3M    ./etc
4.0K    ./lib64
4.0K    ./srv
2.9G    ./usr
7.3M    ./bin
44K     ./.sshd
4.0K    ./app
14M     ./.openmpi
3.6G    ./conda
4.7M    ./vision
8.0K    ./ssh-key
6.5G    .

Most of the spaces are consumed by /usr and /conda directory. /usr is large because nvidia/cuda:9.0-cudnn7-devel-ubuntu16.04 is used as base image of mlbench/mlbench_worker:mlbench-worker-base. See

28K     /run
4.0K    /opt
4.0K    /tmp
8.6M    /var
0       /dev
0       /sys
4.0K    /boot
24M     /lib
4.0K    /home
du: cannot access '/proc/15/task/15/fd/3': No such file or directory
du: cannot access '/proc/15/task/15/fdinfo/3': No such file or directory
du: cannot access '/proc/15/fd/4': No such file or directory
du: cannot access '/proc/15/fdinfo/4': No such file or directory
12K     /proc
4.0K    /mnt
12K     /root
3.5M    /sbin
4.0K    /media
1.8M    /etc
4.0K    /lib64
4.0K    /srv
2.5G    /usr
7.3M    /bin
2.6G    /

This part is necessary in order to use nvidia's driver. As for the /conda, we have already used conda clean --all. So we cannot minimize it a lot if we want to keep large packages in the image, like PyTorch, torchvision, etc.

So we may not drastically reduce the size using multistage in this sense. Instead, it is even slower to build because we need to copy large directories.

Panaetius commented 6 years ago

I'm not sure we need to use conda, we could use pip instead and only copy the python site-packages folder to a new docker stage. That might already cut down on the size, since the conda lib folder is 1.1G, with many libraries we don't need, like Qt. This might also be due to us installing opencv, which has dependencies on ffmpeg (video encoding) and Qt (GUI Library), both of which we don't use. I'm not sure what we need opencv for, but removing it would already reduce size by 700Mb.

We can also remove all the dev packages installed in the beginning, since we don't need gcc and so on later on.

I don't think it's much slower. It's the base image, so we don't need to build it every 5 minutes, and at most, on a non-SSD, it'll add 2 minutes for copying 6Gb, 15 seconds on a modern SSD. But it makes cleaning up a lot easier than having to track everything ourselves.

I think realistically, we can reduce the size by 1-2 Gb

tlin-taolin commented 6 years ago

For opencv, we only use it for image = cv2.imdecode(x, cv2.IMREAD_COLOR).astype('uint8') when using the dataset in the form of lmdb. It is only for the case of ImageNet.

On Fri, Aug 24, 2018 at 9:45 AM Ralf Grubenmann notifications@github.com wrote:

I'm not sure we need to use conda, we could use pip instead and only copy the python site-packages folder to a new docker stage. That might already cut down on the size, since the conda lib folder is 1.1G, with many libraries we don't need, like Qt. This might also be due to us installing opencv, which has dependencies on ffmpeg (video encoding) and Qt (GUI Library), both of which we don't use. I'm not sure what we need opencv for, but removing it would already reduce size by 700Mb.

We can also remove all the dev packages installed in the beginning, since we don't need gcc and so on later on.

I don't think it's much slower. It's the base image, so we don't need to build it every 5 minutes, and at most, on a non-SSD, it'll add 2 minutes for copying 6Gb, 15 seconds on a modern SSD. But it makes cleaning up a lot easier than having to track everything ourselves.

I think realistically, we can reduce the size by 1-2 Gb

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mlbench/mlbench/issues/43#issuecomment-415680192, or mute the thread https://github.com/notifications/unsubscribe-auth/AHcbfwWKhC9aKkG1QGwZp3GrAzvjLQkKks5uT68XgaJpZM4WHH4A .

mlbench / mlbench-old

Use multistage Docker build for worker base docker #43