Closed foster999 closed 11 months ago
I noted a later sklearn version vs the previous issue. So could be another API change on their end? Edit: Rolling back to 1.3.1 has the same issue
Version details:
System:
python: 3.11.4 (main, Jul 18 2023, 13:23:55) [Clang 14.0.3 (clang-1403.0.22.14.1)]
executable: /Users/davidfoster/repos/data_engineering_model/venv/bin/python
machine: macOS-14.1.1-arm64-arm-64bit
Python dependencies:
sklearn: 1.3.2
pip: 23.2.1
setuptools: 65.5.0
numpy: 1.26.0
scipy: 1.11.3
Cython: None
pandas: 2.1.1
matplotlib: 3.8.0
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 10
prefix: libomp
filepath: /Users/davidfoster/repos/data_engineering_model/venv/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None
user_api: blas
internal_api: openblas
num_threads: 10
prefix: libopenblas
filepath: /Users/davidfoster/repos/data_engineering_model/venv/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: armv8
user_api: blas
internal_api: openblas
num_threads: 10
prefix: libopenblas
filepath: /Users/davidfoster/repos/data_engineering_model/venv/lib/python3.11/site-packages/scipy/.dylibs/libopenblas.0.dylib
version: 0.3.21.dev
threading_layer: pthreads
architecture: armv8
Hi @foster999, just a couple of tweaks for clarity:
I'm not familiar with using sklearn, but the Lithops backend implementation is quite straightforward, similar to other implementations like Ray. It essentially spawns the functions it receives from sklearn.
Is there a way to verify the correctness or incorrectness of what you found? Perhaps by trying the Python multiprocessing backend instead of Lithops?"
If I switch the joblib backend to "loky" both the minimal and more complex example run fine. The complex example doesn't error, which means the additional broken function calls aren't being made
I'll try some debugging in the lithops backend to see if I can spot where the additional calls are coming from
Interesting. I Will try to spot it as well. There might be an error in the Lithops backend, even if it is simple.
In brief, the sklearn grid search trains multiple models with combinations of parameters (2 * 3 combinations in this example, so 6 activations). When refit
is on, it should repeat the training of the model with the best outcome. So this should be a single additional activation.
I've found that the switching off the refit
argument stops the 100 activations, so it's likely an issue in the way the model training is being called when refit is True:
import joblib
from lithops.util.joblib import register_lithops
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
digits = load_digits()
param_grid = {
"n_estimators": [100, 50, 25],
}
model = RandomForestClassifier()
search = GridSearchCV(model, param_grid, cv=2, refit=False)
register_lithops()
with joblib.parallel_backend("loky"):
search.fit(
digits.data,
digits.target,
)
print("Best score: %0.3f" % search.best_score_)
# print("Best parameters set:")
# # best_parameters = search.best_estimator_.get_params()
# # print(best_parameters)
This call in sklearn is the one that triggers the 100 activations: https://github.com/scikit-learn/scikit-learn/blob/f0fee2736426553a8a32df1ff153ee9ccdfb434b/sklearn/model_selection/_search.py#L1008C11-L1008C11
It's very confusing, as some or all of the additional activations seems to have the positional arguments to this call reordered 🤔
I'm not sure if this is related, or a secondary issue. The error that I see from the erroneous activations isn't handled correctly by lithops: TypeError: LocalhostHandler.clear() got an unexpected keyword argument 'exception'
It looks like it's using the v1 handler, but passing parameters for the v2 handler
Edit: This must be a separate issue with backwards compatibility of the default/v1 handler. Setting the localhost version to 2 removes this handling error.
Hi, I found and fixed this issue yesterday afternoon, so it is already in master branch if you update.
Thanks @JosepSampe. Do the additional activations reproduce for you with the example above?
I've found that sklearn generates the additional function calls here from a second initialisation of the parallel backend and that these calls are intentional. However, this param prefer="threads"
is supposed to indicate that they should be run as threads on a single process rather than spinning up multiple processes.
sklearn seems to intepret this prefer
param, so it might be a bug in sklearn that means threading is not used for these calls.
I've traced the args and kwargs being passed to these calls and they appear to be correct when passed to the parallel backend, so it seems like lithops could be passing the args incorrectly for these later activations that error
Edit: prefer
is actually interpreted by joblib, so it could be that lithops isn't respecting the preference for these calls?
@foster999 I can confirm I have the same issue with the example above, and with refit
set to False
, those 100 extra functions are not initiated.
EDIT: As you found this prefer
parameter is not interpreted inside the lithops backend. So I think the way to proceed is to check this parameter and when it is set to threads
, spawn only one function that will use a threading pool to execute all the received tasks in a single function invocation.
@foster999 I included the fix in here if you want to give it a try
Thanks very much for adding the feature @JosepSampe, that's working for the example above and the more complex use case 🥳
Using the same example from #1172:
With configuration:
The expected 6 activations are run, followed by another 100 activations. I don't think these 100 should be being triggered?
In more complex examples the 100 activations seem to be running model training with the wrong X and Y datasets, causing other errors. This seems to be specific to this sklearn classifier, as others run without the additional activations.
The log shows: