Open JohannesWiesner opened 11 months ago
Ok haven't managed to replicate exactly on my (windows) laptop or colab. Will try linux later.
In the meantime a smal change to your code is to go with:
# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[list(np.arange(0.1,1.0,0.1)),0]}
Instead of
# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[np.arange(0.1,1.1,0.1),0]}
I'm not sure if the code previously supported numpy arrays and I lost this support in a refactor or if it's always been this way. I think I should be able to add the support back relatively easily.
It's possible that this alone would fix your example because the multiprocessing can give confusing error codes whenever there is a bug
True, now I remember that we had this issue before. Unfortunately I still get the error, even when using param_grid = {'tau':[list(np.arange(0.1,1.0,0.1)),0]}
P.S.: Maybe it would make sense to open a separate issue for the data types in param_grid? I think it would sense if both list, numpy arrays or other iterables would be valid inputs?
I then tried to use simulated data and it seems to work with:
from cca_zoo.data.simulated import LinearSimulatedData
n = 100
p = 10
q = 100
latent_dims = 3
correlation = 0.9
data = LinearSimulatedData(
view_features=[p, q],
latent_dims=latent_dims,
correlation=[0.9,0.8,0.7],
structure='identity'
)
(X,y) = data.sample(n)
groups = np.repeat(np.arange(0,5,1),20)
Here I get LinAlgError: SVD did not converge
but I guess that could stem from a different source (at least GridSearch runs through).
P.S.: Maybe it would make sense to open a separate issue for the data types in param_grid? I think it would sense if both list, numpy arrays or other iterables would be valid inputs?
Yes, agree
I guess I have to take a look at my dataset and check if the error stems from there. Maybe it has to do something with https://github.com/jameschapman19/cca_zoo/issues/175 because that should be the only difference here. Right now, my X, y are already normalized before passing them to GridSearchCV.
I will implement StandardScaler to mimic the old scale=True
behavior and come back to you. Data should not have changed in the mean time so right now I don't really see why it should stem from there.
Okay, I test the following code on a Windows machine and it works fine. Both on the Windows and Ubuntu machine I have scikit-learn 1.3.0 and cca-zoo 2.1.0 installed.
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupShuffleSplit
from cca_zoo.model_selection import GridSearchCV
from cca_zoo.linear import rCCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from mvlearn.utils import check_Xs
from sklearn.base import TransformerMixin
from sklearn.utils.validation import check_is_fitted
###############################################################################
## Prepare Analysis ###########################################################
###############################################################################
rng = np.random.RandomState(42)
brain_df = np.loadtxt('brain_df.txt')
behavior_df = np.loadtxt('behavior_df.txt')
groups = np.loadtxt('groups.txt')
###############################################################################
## Analysis settings ##########################################################
###############################################################################
# define latent dimensions
latent_dimensions = 1
# define cross validation strategy
cv = GroupShuffleSplit(n_splits=10,train_size=0.7,random_state=rng)
# define a search space (optimize left and right penalty parameters)
param_grid = {'cca__c':[list(np.arange(0.1,1.1,0.1)),list(np.arange(0.1,1.1,0.1))]}
"""
Class which allows for the different (or the same) processing of multiple views of data.
"""
class MultiViewPreprocessing(TransformerMixin):
def __init__(self, preprocessing_list):
self.preprocessing_list = preprocessing_list
def fit(self, views, y=None):
"""
Fits the associated preprocessing steps to each view.
Parameters
----------
views
y
Returns
-------
"""
if len(self.preprocessing_list) == 1:
self.preprocessing_list = self.preprocessing_list * len(views)
elif len(self.preprocessing_list) != len(views):
raise ValueError("Length of preprocessing_list must be 1 (apply the same preprocessing to each view) or equal to the number of views")
check_Xs(views, enforce_views=range(len(self.preprocessing_list)))
for view, preprocessing in zip(views, self.preprocessing_list):
preprocessing.fit(view, y)
return self
def transform(self, X, y=None):
"""
Transforms each view using the associated preprocessing steps.
Parameters
----------
X
y
Returns
-------
"""
[check_is_fitted(preprocessing) for preprocessing in self.preprocessing_list]
check_Xs(X, enforce_views=range(len(self.preprocessing_list)))
return [preprocessing.transform(view) for view, preprocessing in zip(X, self.preprocessing_list)]
# # define an estimator
estimator = Pipeline([
('preprocessing', MultiViewPreprocessing((StandardScaler(),StandardScaler()))),
('cca',rCCA(latent_dimensions=latent_dimensions,random_state=rng))
])
###############################################################################
## Run GridSearch
##############################################################################
def scorer(estimator, views):
scores = estimator.score(views)
return np.mean(scores)
grid = GridSearchCV(estimator,param_grid,scoring=scorer,n_jobs=5,cv=cv)
grid.fit([brain_df,behavior_df],groups=groups)
best_params = grid.best_params_
estimator_best = grid.best_estimator_
X_weights,y_weights = estimator_best.weights
print(f"Best parameters are: {best_params}\n")
Data:
behavior_df.txt brain_df.txt groups.txt
Could you perhaps also re-check on a Linux machine if this is an OS issue?
Works fine or doesn't work fine?
Ah sorry. Works fine on Windows but not on Ubuntu.
ok - got an error message I can see? I can try and get one myself otherwise XD
Here's the complete traceback:
Traceback (most recent call last):
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
exec(code, globals, locals)
File ~/work/projects/project_hcp/testing/test_cca.py:108
grid.fit([brain_df,behavior_df],groups=groups)
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/cca_zoo/model_selection/_search.py:208 in fit
self = BaseSearchCV.fit(self, np.hstack(X), y=y, groups=groups, **fit_params)
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/base.py:1151 in wrapper
return fit_method(estimator, *args, **kwargs)
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/model_selection/_search.py:898 in fit
self._run_search(evaluate_candidates)
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/cca_zoo/model_selection/_search.py:199 in _run_search
evaluate_candidates(param_grid)
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/model_selection/_search.py:845 in evaluate_candidates
out = parallel(
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/utils/parallel.py:65 in __call__
return super().__call__(iterable_with_config)
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1944 in __call__
return output if self.return_generator else list(output)
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1587 in _get_outputs
yield from self._retrieve()
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1691 in _retrieve
self._raise_error_fast()
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1726 in _raise_error_fast
error_job.get_result(self.timeout)
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:735 in get_result
return self._return_or_raise()
File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:753 in _return_or_raise
raise self._result
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGSEGV(-11), SIGSEGV(-11)}
Okay, tested it on our Linux server. Same error here. Seems to be an OS-issue!
OK I think it's also possible that its consuming more than expected memory. Will investigate - apologies and thanks for bringing this to my attention!
I'm thinking trying two things will help diagnose this:
1 will help work out if it is multiprocessing causing the problem, 2 will help work out if it is in the pipeline
Yup, with n_jobs=1
I don't have this issue, but of course, that makes sense because apparently, it's a parallelization issue. Removing preprocessing does not change the error.
Thanks for this. Will have a dig around.
Hi @JohannesWiesner. From some reading I'm thinking this is versions of scipy/numpy e.g.
because there's nothing substantive that has changed to rCCA which could have caused this (it essentially does a lossless PCA [keeping all components] for efficiency and then sets up an eigenvalue problem which it sends to scipy).
So I think if you give updating scipy/numpy a go that might work?
Hm, are you sure? Also tried it with SCCA_PMD
and getting the same error. Will update scipy/numpy and also do a sanity check with other sparse CCAs!
Ah true, all of those will use the same underlying numpy/scipy functions I guess. You'll get an update tomorrow!
Worked: On our Windows machine, these versions are installed:
numpy 1.23.3 py39h9061af7_0 conda-forge
scipy 1.9.1 py39h316f440_0 conda-forge
scikit-learn 1.3.0 pypi_0 pypi
cca-zoo 2.1.0
Worked: We also tested in a Docker Container (with Debian-Bookworm as the base image) running on Windows:
numpy 1.21.5 pypi_0 pypi
scipy 1.10.1 py39h7360e5f_0 conda-forge
scikit-learn 1.2.2 py39h86b2a18_0 conda-forge
cca-zoo 2.1.0
Worked: Then I set up a completely fresh conda environment with cca-zoo only:
numpy 1.25.2
scipy 1.11.1
scikit-learn 1.3.0
cca-zoo 2.1.0
Did not work: And in my default conda environment I got these versions:
numpy 1.21.5 pypi_0 pypi
scipy 1.10.1 py38h59b608b_3 conda-forge
scikit-learn 1.3.0 py38hc099248_0 conda-forge
cca-zoo 2.1.0
Hard to say what's causing the issue. Can't really see, how numpy
or scipy
versions could be responsible. Maybe it's a complex interplay between different packages? Also checked if it's the Python version causing the issue. Created a second environment forcing python to 3.8.17 (to match the python version of my non-working environemt) but couldn't reproduce the error.
Ergh! And this worked in a previous version?
It does seem to be a problem elsewhere: https://stackoverflow.com/questions/53757856/segmentation-fault-when-creating-multiprocessing-array
Geez, that sounds not trivial. For now, I will just use the working conda environment for the analysis. Let me know, if I should test something out for you. Probably a good idea to implement a testing workflow with different os-runners in the long term.
Agree about testing with different OS - my ‘hack’ has been that I develop on windows and the automatic tests here use Ubuntu.
although weirdly that suggests the package does work on Ubuntu! So suggests I need to make the numpy/scipy versions explicit (I’ve tended towards laziness/relying on scikit-learn dependencies to be about right)
So this passes all the tests on Ubuntu:
Installing numpy (1.24.4)
Installing scipy (1.9.3)
Installing scikit-learn (1.3.0)
If that works then I’ll make the dependencies hard to avoid your issue in the future- thanks and apologies! I’m always learning 🙏
Ah no because I haven’t been testing the jobs>1 behaviour. Will add to the tests
Agree about testing with different OS - my ‘hack’ has been that I develop on windows and the automatic tests here use Ubuntu.
although weirdly that suggests the package does work on Ubuntu! So suggests I need to make the numpy/scipy versions explicit (I’ve tended towards laziness/relying on scikit-learn dependencies to be about right)
Should be feasible to implement a CI-Workflow with different os-runners and and then running pytest.py test for each of them.
Could send a PR if I have some time
Hi James, with the latest version of cca_zoo I get this error:
Didn't happen in older versions (although I am using the exact same script). Can you reproduce this? Here's my full code + attached my X,y and groups as txt files.
Data:
groups.txt X.txt y.txt
Note that
X
andy
have been normalized prior to GridSearch, so each fold "sees" different batches of the normalized dataset. Not sure if this is related to https://github.com/jameschapman19/cca_zoo/issues/175