Closed jackmorgenson closed 3 years ago
Hi @jackmorgenson! I wasn't aware of such a problem. Thank you.
This should be easy to add. I will add this in the next release 0.9.0
.
What do you mean that MLJAR AutoML fails? Are there any errors? or it was slow to train?
Linear, LightGBM, XGBoost, anything that uses n_jobs... the training will eventually time out. I ran the muli-class classification example as-is. Here is a comparison:
226.03 seconds
.times out
, sorry I forget what the timeout was, but it was at least 30 minutes? Anyway, CPU usage was consistently at 800%, however there were dozens and dozens of python processes trying to run because the OS can see 88 processors. So, in our organization we always have folks specify n_jobs to the number of logical processors they specified when spawning the notebook. The n_jobs
parameter is added in AutoML()
constructor. By default, it is set to -1
which means that all CPUs will be used.
Unfortunately, MLP implementation from sklearn doesn't support the n_jobs
parameter, so when a number of jobs is set different than -1 then the Neural Network
algorithm is disabled (not trained at all).
The changes will go to the next release 0.9.0
. Right now there are in the dev
branch. To install the package with the newest changes please run:
pip install -U git+https://github.com/mljar/mljar-supervised.git@dev
@pplonski
I'm using 0.10.4
, and when I set n_jobs=1
under AutoML()
it fires up every single core when I check it on htop.
It seems that n_jobs
may not be behaving as desired. I only want it to use 10 cores at a time, and no more than that since I work on a shared server.
Any help to this regard would be greatly appreciated!
@tijeco there was a bug that feature importance computation was using all cores - https://github.com/mljar/mljar-supervised/issues/398 - it is fixed in 0.10.6
. Do you compute feature importance?
@pplonski thanks! I didn't notice the other issue. I'll get 0.10.6
and see how that goes.
I do intend to calculate feature importance!
@pplonski So I monitored htop closely as it ran with n_jobs=1
on a smallish classification dataset.
When Xgboost starts, all the cores start lighting up. Maybe there is something in Xgboost code that disregards n_jobs
?
@pplonski I found a workaround for the meantime! I just learned about taskset
, and honestly I'm embarrassed that I haven't heard of it sooner. But it limits the thread usage of a given process.
taskset -c 0-10 python xxx.py
limits a process to 10 threads
@tijeco glad that you found the solution, it might be something with external packages.
I found this stackoverflow discussion https://stackoverflow.com/questions/48269248/limiting-the-number-of-threads-used-by-xgboost
Solution was to set:
import os
os.environ['OMP_NUM_THREADS'] = "1"
Maybe you can try to run your code but with setting OMP_NUM_THREADS
at the beginning?
I've read (and negatively experienced) that
n_jobs = -1
is hard coded for applicable models such as LightGBM, XGBoost, etc. My organization runs Jupyter notebooks in a kubernetes containerized environment. Unfortunately, that means the container OS and python see all CPU cores of the underlying physical host. For example, a notebook is spawned with 4 logical processors. However,multiprocessor.cpu_count()
reveals 88 processors, as it can see through the container layer.With n_jobs hardcoded to -1, the mljar AutoML fails because it thinks it can run 88 parallel threads on 4 logical processors, so basically the notebook just freezes and panics because of all of the CPU thrashing.
Please consider making
n_jobs
configurable.Thanks!