Open RNarayan73 opened 2 years ago
Looking at the code, we are trying to maintain full cluster resource utilization by dividing the number of all CPUs in a cluster by n_jobs and using that as the number of CPUs to use. Honestly, we are probably trying to be too clever here, and this should be handled by a different system altogether. Will add that to the backlog.
As a workaround, try running ray.init(num_cpus=14)
before initializing TuneSearchCV(n_jobs=None)
.
Hi @Yard1
On further digging, I tried out the following scenarios and came across these observations
1) I realised that the status quo is actually worse than what I had described above. Not only does it request up to the max virtual cpus (20), it actually seems to utilise only a max of 1/2 the requested cpus at one time (I'm guessing it arbitrarily assumes that the physical cores are 1/2 the total cpus reported by cpu_count. That may have been fine until the Intel 12th gen cpus came along, which have 14 cores and 20 virtual cpus). So, with the status quo, 4 of the cores will never be used and it might be worth revisiting the logic for physical core count to use psutil.cpu_count(logical=False) instead os.cpu_count()!
2) Your proposed workaround does seem to show the max number of cpus requested correctly (14) and actually does seem to run a max of 14 processes at any one time - when TuneSearchCV is run once. However, it causes issues when you try to parallelize the run with joblib.Parallel with the default backend loky, as in cross_validate().
2a) For example, when I initialise ray separately with ray.init, and then call TuneSearchCV, cv.n_splits times in a loop through cross_validate(), which uses loky backend for parallel processing, the first instance of ray (gcs_server + raylet + n_jobs processes) is completely ignored by the subsequent calls to TuneSearchCV which just goes on to generate completely new instances (gcs_server + raylet + n_jobs processes) cv.n_splits times. It eventually reaches 100's of additional processes for a 5-fold cross validation, leading to massive oversubscription, often causing OOM or crashes.
2b) If I don't initialise ray.init separately and follow the status quo, leaving TuneSearchCV to do the initialisation within cross_validate, there is no 'stranded' instance. Besides the issue with using only 10 cores described above, the looping through cv.n_splits still generates multiple instances of ray (gcs_server + raylet + n_jobs processes) for each split within cross_validate with the same effects as above. However, there is 1 less instance hogging memory here thereby causing fewer crashes, albeit with massive oversubscription, so I have fallen back on this approach!
2c) The only way to avoid this is to not parallelise the run at all when using parallel_backend by passing n_jobs = 1 and execute all the cv.n_splits within cross_validate() in sequence with n_jobs=1 which is a shame as it underutilises available cores.
Despite the issues cited above, in all cases above, I get consistent, repeatable results for my metrics.
3) Finally, I registered ray as a backend in parallel_backend and tried to use it to parallelise the runs instead of the default loky. This works well without oversubscription, and if I declare the num_cpus in ray.init() it uses all the 14 cores but issues a warning message as below:
> 2022-11-14 22:15:02,545 WARNING pool.py:591 -- The 'context' argument is not supported using ray. Please refer to the documentation for how to control ray initialization.
> [Parallel(n_jobs=14)]: Using backend RayBackend with 14 concurrent workers.
However, the metrics results are not repeatable despite using the exact same setup and random_state seed as earlier.
Conclusions and questions: A) As it stands it seems like loky is incompatible with ray for parallelisation. Is it possible to get TuneSearchCV to play nice with loky so that we can get repeatable results?
B) What could be the reason for the non-repeatable results when using ray in parallel_backend? Can it be addressed?
Thanks Narayan
@Yard1 Is this issue on the backlog for an imminent release? Regards Narayan
Hello,
I have passed a value equal to the physical cores on my computer = 14 to n_jobs when calling TuneSearchCV as against the total cpu_count of 20 which includes virtual cpus as per the parameters below (truncated for brevity):
However, it seems to ignore this and requests > 14 cpus when executing as shown below:
Has anyone come across this? Or am I doing something wrong?
Please let me know if you need more info.
Regards Narayan