ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.36k stars 5.65k forks source link

[jenkins builds] please limit the number of CPUs you reserve during your tests #2595

Closed shaneknapp closed 4 years ago

shaneknapp commented 6 years ago

System information

Describe the problem

when your tests run on jenkins, it looks like you calculate how many available cpus are on the machine, and then allocate some to your builds. this can lead to over-subscribing CPUs, causing massive system load.

ray-load

your builds are pushing system loads to nearly 200... and w/only 48 HT cores on these boxes, there are ~150 threads waiting for available CPU cycles.

since we currently allow more than one Ray build per node (PRB and master), this can really lead to CPU bottlenecking and increase system latency.

i would strongly recommend putting a cap on the number of allocated CPUs to a much more reasonable number... maybe 6 or 8 at the most.

Source code / logs

from https://amplab.cs.berkeley.edu/jenkins/job/Ray/4277/consoleFull :

== Status == Using FIFO scheduling algorithm. Resources requested: 17/48 CPUs, 0/0 GPUs Result logdir: /root/ray_results/default RUNNING trials:

shaneknapp commented 6 years ago

the PRB is pretty crazy, too. two running on one worker gives us the following:

screen shot 2018-08-07 at 4 31 58 pm