Closed jkremser closed 7 years ago
cool - where is the space savings coming from? the spark install looks equivalent. is it the consolidated yum installs?
From some strange reason this line increased the size by ~220 megs. All the script does is this: https://github.com/radanalyticsio/openshift-spark/blob/master/scripts/spark/install (some mv
and chmod
), but I was checking it also with the docker save
and exploring the tar file and there were 2 layers that contained basically the same files
It's also visible on the docker history command:
λ docker history dc430995c1b2
IMAGE CREATED CREATED BY SIZE COMMENT
dc430995c1b2 5 days ago /bin/sh -c #(nop) CMD ["/opt/spark/bin/la... 0 B
<missing> 5 days ago /bin/sh -c #(nop) ENTRYPOINT ["/entrypoint"] 0 B
<missing> 5 days ago /bin/sh -c #(nop) WORKDIR /tmp 0 B
<missing> 5 days ago /bin/sh -c #(nop) USER [185] 0 B
<missing> 5 days ago |2 DISTRO_LOC=https://archive.apache.org/d... 0 B
<missing> 5 days ago |2 DISTRO_LOC=https://archive.apache.org/d... 224 MB <- HERE (2nd)
<missing> 5 days ago /bin/sh -c #(nop) COPY dir:1e62abd6f03d8cc... 445 kB
<missing> 5 days ago /bin/sh -c #(nop) COPY dir:926003162688cd8... 7.59 kB
<missing> 5 days ago /bin/sh -c #(nop) ENV SPARK_HOME=/opt/spark 0 B
<missing> 5 days ago /bin/sh -c #(nop) ENV PATH=/usr/local/sbi... 0 B
<missing> 5 days ago |2 DISTRO_LOC=https://archive.apache.org/d... 124 MB
<missing> 5 days ago |2 DISTRO_LOC=https://archive.apache.org/d... 223 MB <- HERE (1st)
<missing> 5 days ago |2 DISTRO_LOC=https://archive.apache.org/d... 162 MB
<missing> 5 days ago /bin/sh -c #(nop) ARG DISTRO_NAME=spark-2... 0 B
<missing> 5 days ago /bin/sh -c #(nop) ARG DISTRO_LOC=https://... 0 B
<missing> 5 days ago /bin/sh -c #(nop) USER [root] 0 B
<missing> 5 days ago /bin/sh -c #(nop) MAINTAINER Matthew Farr... 0 B
<missing> 12 days ago /bin/sh -c #(nop) CMD ["/bin/bash"] 0 B
<missing> 12 days ago /bin/sh -c #(nop) LABEL name=CentOS Base ... 0 B
<missing> 12 days ago /bin/sh -c #(nop) ADD file:63492ba809361c5... 193 MB
The 1st '223 MB' are when the curl + tar combo was used so it's expected, but the second one was when running the install script.
Here are those 2 lines again but w/ the full command (--no-trunc
option)
<missing> 5 days ago |2 DISTRO_LOC=https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz DISTRO_NAME=spark-2.1.0-bin-hadoop2.7 bash -x /tmp/scripts/spark/install 224 MB
<missing> 5 days ago |2 DISTRO_LOC=https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz DISTRO_NAME=spark-2.1.0-bin-hadoop2.7 /bin/sh -c cd /opt && curl $DISTRO_LOC | tar -zx && ln -s $DISTRO_NAME spark 223 MB
When doing everything inside one RUN
command, it created just 1 layer :+1: Tomorrow, I'll double-check if the image still works as expected.
one thing we're trying to do is keep the structure of this dockerfile similar to one dogen would create.
does this change the structural alignment w/ dogen?
would it also reduce the size if https://github.com/radanalyticsio/openshift-spark/blob/master/Dockerfile#L6-L30 were simply moved to the install script?
@mattf excuse my ignorance but what is dogen?
would it also reduce the size if /Dockerfile@master#L6-L30 were simply moved to the install script?
yes, that's another approach, having everything in a shell script and run this shell script from the dockerfile with the RUN
directive, although it would make the dockerfile less transparent.
@mattf this is the Dockerfile that is generated by the dogen running on this template. It's very different from the Dockerfile that is currently here in master. There is different version of spark and different packages. I think it shouldn't block this PR, because the template looks very outdated anyway.
However, I can put all those commands to a bash script so that it would be easier to keep the dockerfile and the yaml template in sync. On the other hand, it will not be obvious what the dockerfile does, without looking into the script. The dockerfiles (only the dockerfiles, no external scripts) are for instance showed on the dockerhub https://hub.docker.com/r/radanalyticsio/openshift-spark/~/dockerfile/
wdyt? And probably we should also wait for @tmckayus' opinion.
@Jiri-Kremser you're right that moving the RUNs into a script reduces transparency and is less idiomatic wrt Dockerfiles. it's more idiomatic for dogen. if the space savings are the same with both approaches i'm ok with keeping the RUNs in the Dockerfile. i want to avoid additional bloat if we move to dogen as the primary means of creating the image.
ok to merge and refactor
This PR reduces the size of the image from 927 MB to
626683 MB. Because other our images are based on this one, it'll also reduce their size.update: I've checked the modified image with the
https://radanalytics.io/resources.yaml
template and it works well.update2: ..and the var notebook also worked well
@tmckayus could you please take a look?