For master and worker nodes to infer the JVM memory settings based on cgroups

jkremser commented 6 years ago

So if there is USE_CGROUP_MEMORY set to true, it will bypass the spark-class script and start the worker or master with those two custom JVM options directly. If it's not set or set to false the previous approach is taken.

start two version of masters, one that is cgroups aware and 1 that is not:

docker run -d --rm --name=no-cg \
   --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
  jkremser/openshift-spark:cgroups

docker run -d --rm --name=cg --memory 500m -e USE_CGROUP_MEMORY=true \
  --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
 jkremser/openshift-spark:cgroups

# install jmap on both
docker exec --user 0 cg yum --enablerepo="*-debug*" install -y java-1.8.0-openjdk-devel java-1.8.0-openjdk-debuginfo
docker exec --user 0 no-cg yum --enablerepo="*-debug*" install -y java-1.8.0-openjdk-devel java-1.8.0-openjdk-debuginfo

# print the memory settings for JVM
echo -e "\n\n Heap config for cg:"
docker exec cg jmap -heap `docker exec cg jps -q | sort | head -1` | grep -C3 MaxHeapSize
echo -e "\n\n Heap config 4 no-cg:"
docker exec no-cg jmap -heap `docker exec no-cg jps -q | sort | head -1` | grep -C3 MaxHeapSize

note:

--cap-add=SYS_PTRACE --security-opt seccomp=unconfined

is there only for jmap to work properly, I like the output of jmap better than printing the low lvl stuff from either /proc/pid/*, /sys/fs/cgroup/memory/memory.* or /proc/meminfo. Because it shows the actual JVM memory allocations.

tmckayus commented 4 years ago

@jkremser this should be a change for the py36 and incomplete images as well, correct? @elmiko ptal rerunning/debugging the failed test

elmiko commented 4 years ago

yeah, this probably needs to be implemented for py36 as well

tmckayus commented 4 years ago

okay turns out there is a mistake in here, discovered in https://github.com/radanalyticsio/openshift-spark/pull/103

I'll fix it tomorrow :) Basically, SPARK_MASTER_ADDRESS needs to be on the worker START_CLASS, not the master START_CLASS, and we need the build dirs for all versions.

tmckayus commented 4 years ago

@jkremser I don't seem to be able to push a tweak to this PR. Can you give me write access to your fork so I can update? Alternatively I can submit another PR (or mail you a patch)

jkremser commented 4 years ago

@tmckayus please open a new PR w/ my commit and your tweak commit on top of it. May the ~force~ git cherry-pick be with you 🖖

tmckayus commented 4 years ago

@elmiko @jkremser ptal So actually it seems that if we wait long enough, problems resolve themselves :) I think we can close this without merging.

Based on content in this blog post, Java 8 received a backport from Java 10 that makes this change unnecessary after openjdk build 1.8.0_121 https://merikan.com/2019/04/jvm-in-a-container/

Looks like build 1.8.0_232 in 2.4-latest, and the heap size is small:

$ docker run -m 100MB radanalyticsio/openshift-spark:2.4-latest java -XshowSettings:vm -version VM settings: Max. Heap Size (Estimated): 48.38M Ergonomics Machine Class: server Using VM: OpenJDK 64-Bit Server VM

openjdk version "1.8.0_232" OpenJDK Runtime Environment (build 1.8.0_232-b09) OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

If we go all the way back to 2.2-latest, we see something different $ docker run -m 100MB radanalyticsio/openshift-spark:2.2-latest java -XshowSettings:vm -version VM settings: Max. Heap Size (Estimated): 3.45G Ergonomics Machine Class: server Using VM: OpenJDK 64-Bit Server VM

openjdk version "1.8.0_161" OpenJDK Runtime Environment (build 1.8.0_161-b14) OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

elmiko commented 4 years ago

sounds good to me, do we need to document that this takes some "settling" time?

tmckayus commented 4 years ago

sounds good to me, do we need to document that this takes some "settling" time?

I don't think it needs settling time -- I meant the PR was resolved via inaction :)

jkremser commented 4 years ago

issues that resolves themselves when not looking into them are the best ones :) I am closing the PR then. If you decide that you need this also in older version of openshift-spark it would be probably better to use the newer java version rather than this patch that adds complexity to the bash launch script.

jkremser commented 4 years ago

I will keep the branch alive, though.. just for sure

radanalyticsio / openshift-spark

For master and worker nodes to infer the JVM memory settings based on cgroups #71