Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
when your tests run on jenkins, it looks like you calculate how many available cpus are on the machine, and then allocate some to your builds. this can lead to over-subscribing CPUs, causing massive system load.
your builds are pushing system loads to nearly 200... and w/only 48 HT cores on these boxes, there are ~150 threads waiting for available CPU cycles.
since we currently allow more than one Ray build per node (PRB and master), this can really lead to CPU bottlenecking and increase system latency.
i would strongly recommend putting a cap on the number of allocated CPUs to a much more reasonable number... maybe 6 or 8 at the most.
System information
Describe the problem
when your tests run on jenkins, it looks like you calculate how many available cpus are on the machine, and then allocate some to your builds. this can lead to over-subscribing CPUs, causing massive system load.
your builds are pushing system loads to nearly 200... and w/only 48 HT cores on these boxes, there are ~150 threads waiting for available CPU cycles.
since we currently allow more than one Ray build per node (PRB and master), this can really lead to CPU bottlenecking and increase system latency.
i would strongly recommend putting a cap on the number of allocated CPUs to a much more reasonable number... maybe 6 or 8 at the most.
Source code / logs
from https://amplab.cs.berkeley.edu/jenkins/job/Ray/4277/consoleFull :
== Status == Using FIFO scheduling algorithm. Resources requested: 17/48 CPUs, 0/0 GPUs Result logdir: /root/ray_results/default RUNNING trials: