radanalyticsio / openshift-spark

72 stars 83 forks source link

Thinner docker image #32

Closed jkremser closed 7 years ago

jkremser commented 7 years ago

This PR reduces the size of the image from 927 MB to 626 683 MB. Because other our images are based on this one, it'll also reduce their size.

update: I've checked the modified image with the https://radanalytics.io/resources.yaml template and it works well.

update2: ..and the var notebook also worked well

@tmckayus could you please take a look?

erikerlandson commented 7 years ago

cool - where is the space savings coming from? the spark install looks equivalent. is it the consolidated yum installs?

jkremser commented 7 years ago

From some strange reason this line increased the size by ~220 megs. All the script does is this: https://github.com/radanalyticsio/openshift-spark/blob/master/scripts/spark/install (some mv and chmod), but I was checking it also with the docker save and exploring the tar file and there were 2 layers that contained basically the same files

It's also visible on the docker history command:

λ docker history dc430995c1b2
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
dc430995c1b2        5 days ago          /bin/sh -c #(nop)  CMD ["/opt/spark/bin/la...   0 B                
<missing>           5 days ago          /bin/sh -c #(nop)  ENTRYPOINT ["/entrypoint"]   0 B                
<missing>           5 days ago          /bin/sh -c #(nop) WORKDIR /tmp                  0 B                
<missing>           5 days ago          /bin/sh -c #(nop)  USER [185]                   0 B                
<missing>           5 days ago          |2 DISTRO_LOC=https://archive.apache.org/d...   0 B                
<missing>           5 days ago          |2 DISTRO_LOC=https://archive.apache.org/d...   224 MB       <- HERE (2nd)       
<missing>           5 days ago          /bin/sh -c #(nop) COPY dir:1e62abd6f03d8cc...   445 kB              
<missing>           5 days ago          /bin/sh -c #(nop) COPY dir:926003162688cd8...   7.59 kB            
<missing>           5 days ago          /bin/sh -c #(nop)  ENV SPARK_HOME=/opt/spark    0 B                
<missing>           5 days ago          /bin/sh -c #(nop)  ENV PATH=/usr/local/sbi...   0 B                
<missing>           5 days ago          |2 DISTRO_LOC=https://archive.apache.org/d...   124 MB              
<missing>           5 days ago          |2 DISTRO_LOC=https://archive.apache.org/d...   223 MB       <- HERE (1st)       
<missing>           5 days ago          |2 DISTRO_LOC=https://archive.apache.org/d...   162 MB              
<missing>           5 days ago          /bin/sh -c #(nop)  ARG DISTRO_NAME=spark-2...   0 B                
<missing>           5 days ago          /bin/sh -c #(nop)  ARG DISTRO_LOC=https://...   0 B                
<missing>           5 days ago          /bin/sh -c #(nop)  USER [root]                  0 B                
<missing>           5 days ago          /bin/sh -c #(nop)  MAINTAINER Matthew Farr...   0 B                
<missing>           12 days ago         /bin/sh -c #(nop)  CMD ["/bin/bash"]            0 B                
<missing>           12 days ago         /bin/sh -c #(nop)  LABEL name=CentOS Base ...   0 B                
<missing>           12 days ago         /bin/sh -c #(nop) ADD file:63492ba809361c5...   193 MB

The 1st '223 MB' are when the curl + tar combo was used so it's expected, but the second one was when running the install script.

Here are those 2 lines again but w/ the full command (--no-trunc option)

<missing>      5 days ago          |2 DISTRO_LOC=https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz DISTRO_NAME=spark-2.1.0-bin-hadoop2.7 bash -x /tmp/scripts/spark/install                                                              224 MB
<missing>      5 days ago          |2 DISTRO_LOC=https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz DISTRO_NAME=spark-2.1.0-bin-hadoop2.7 /bin/sh -c cd /opt &&     curl $DISTRO_LOC  |         tar -zx &&     ln -s $DISTRO_NAME spark   223 MB

When doing everything inside one RUN command, it created just 1 layer :+1: Tomorrow, I'll double-check if the image still works as expected.

mattf commented 7 years ago

one thing we're trying to do is keep the structure of this dockerfile similar to one dogen would create.

does this change the structural alignment w/ dogen?

would it also reduce the size if https://github.com/radanalyticsio/openshift-spark/blob/master/Dockerfile#L6-L30 were simply moved to the install script?

jkremser commented 7 years ago

@mattf excuse my ignorance but what is dogen?

would it also reduce the size if /Dockerfile@master#L6-L30 were simply moved to the install script?

yes, that's another approach, having everything in a shell script and run this shell script from the dockerfile with the RUN directive, although it would make the dockerfile less transparent.

jkremser commented 7 years ago

@mattf this is the Dockerfile that is generated by the dogen running on this template. It's very different from the Dockerfile that is currently here in master. There is different version of spark and different packages. I think it shouldn't block this PR, because the template looks very outdated anyway.

However, I can put all those commands to a bash script so that it would be easier to keep the dockerfile and the yaml template in sync. On the other hand, it will not be obvious what the dockerfile does, without looking into the script. The dockerfiles (only the dockerfiles, no external scripts) are for instance showed on the dockerhub https://hub.docker.com/r/radanalyticsio/openshift-spark/~/dockerfile/

wdyt? And probably we should also wait for @tmckayus' opinion.

mattf commented 7 years ago

@Jiri-Kremser you're right that moving the RUNs into a script reduces transparency and is less idiomatic wrt Dockerfiles. it's more idiomatic for dogen. if the space savings are the same with both approaches i'm ok with keeping the RUNs in the Dockerfile. i want to avoid additional bloat if we move to dogen as the primary means of creating the image.

mattf commented 7 years ago

ok to merge and refactor