TerminatedWorkerError when using GridSearchCV

JohannesWiesner commented 11 months ago

Hi James, with the latest version of cca_zoo I get this error:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGSEGV(-11), SIGSEGV(-11), SIGSEGV(-11)}

Didn't happen in older versions (although I am using the exact same script). Can you reproduce this? Here's my full code + attached my X,y and groups as txt files.

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
from cca_zoo.model_selection import GridSearchCV
from cca_zoo.linear import SCCA_PMD

###############################################################################
## Settings ###################################################################
###############################################################################

n_jobs = 8
pre_dispatch = 3
rng = np.random.RandomState(42)

###############################################################################
## Prepare Analysis ###########################################################
###############################################################################

X = np.loadtxt('X.txt')
y = np.loadtxt('y.txt')
groups = np.loadtxt('groups.txt')

###############################################################################
## Analysis settings ##########################################################
###############################################################################

# define latent dimensions
latent_dimensions = 3

# pretend that there are subject groups in the dataset
cv = GroupShuffleSplit(n_splits=10,train_size=0.7,random_state=rng)

# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[np.arange(0.1,1.1,0.1),0]}

# define an estimator
estimator = SCCA_PMD(latent_dimensions=latent_dimensions,random_state=rng)

##############################################################################
## Run GridSearch
##############################################################################

def scorer(estimator, views):
    scores = estimator.score(views)
    return np.mean(scores)

grid = GridSearchCV(estimator,param_grid,scoring=scorer,n_jobs=n_jobs,cv=cv)
grid.fit([X,y],groups=groups)

Data:

groups.txt X.txt y.txt

Note that X and y have been normalized prior to GridSearch, so each fold "sees" different batches of the normalized dataset. Not sure if this is related to https://github.com/jameschapman19/cca_zoo/issues/175

jameschapman19 commented 11 months ago

Ok haven't managed to replicate exactly on my (windows) laptop or colab. Will try linux later.

In the meantime a smal change to your code is to go with:

# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[list(np.arange(0.1,1.0,0.1)),0]}

Instead of

# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[np.arange(0.1,1.1,0.1),0]}

I'm not sure if the code previously supported numpy arrays and I lost this support in a refactor or if it's always been this way. I think I should be able to add the support back relatively easily.

It's possible that this alone would fix your example because the multiprocessing can give confusing error codes whenever there is a bug

JohannesWiesner commented 11 months ago

True, now I remember that we had this issue before. Unfortunately I still get the error, even when using param_grid = {'tau':[list(np.arange(0.1,1.0,0.1)),0]}

P.S.: Maybe it would make sense to open a separate issue for the data types in param_grid? I think it would sense if both list, numpy arrays or other iterables would be valid inputs?

JohannesWiesner commented 11 months ago

I then tried to use simulated data and it seems to work with:

from cca_zoo.data.simulated import LinearSimulatedData
n = 100
p = 10
q = 100
latent_dims = 3
correlation = 0.9

data = LinearSimulatedData(
    view_features=[p, q],
    latent_dims=latent_dims,
    correlation=[0.9,0.8,0.7],
    structure='identity'
)
(X,y) = data.sample(n)
groups = np.repeat(np.arange(0,5,1),20)

Here I get LinAlgError: SVD did not converge but I guess that could stem from a different source (at least GridSearch runs through).

jameschapman19 commented 11 months ago

P.S.: Maybe it would make sense to open a separate issue for the data types in param_grid? I think it would sense if both list, numpy arrays or other iterables would be valid inputs?

Yes, agree

JohannesWiesner commented 11 months ago

I guess I have to take a look at my dataset and check if the error stems from there. Maybe it has to do something with https://github.com/jameschapman19/cca_zoo/issues/175 because that should be the only difference here. Right now, my X, y are already normalized before passing them to GridSearchCV.

JohannesWiesner commented 11 months ago

I will implement StandardScaler to mimic the old scale=True behavior and come back to you. Data should not have changed in the mean time so right now I don't really see why it should stem from there.

JohannesWiesner commented 11 months ago

Okay, I test the following code on a Windows machine and it works fine. Both on the Windows and Ubuntu machine I have scikit-learn 1.3.0 and cca-zoo 2.1.0 installed.

import numpy as np
import pandas as pd

from sklearn.model_selection import GroupShuffleSplit
from cca_zoo.model_selection import GridSearchCV
from cca_zoo.linear import rCCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from mvlearn.utils import check_Xs
from sklearn.base import TransformerMixin
from sklearn.utils.validation import check_is_fitted

###############################################################################
## Prepare Analysis ###########################################################
###############################################################################

rng = np.random.RandomState(42)

brain_df = np.loadtxt('brain_df.txt')
behavior_df = np.loadtxt('behavior_df.txt')
groups = np.loadtxt('groups.txt')

###############################################################################
## Analysis settings ##########################################################
###############################################################################

# define latent dimensions
latent_dimensions = 1

# define cross validation strategy
cv = GroupShuffleSplit(n_splits=10,train_size=0.7,random_state=rng)

# define a search space (optimize left and right penalty parameters)
param_grid = {'cca__c':[list(np.arange(0.1,1.1,0.1)),list(np.arange(0.1,1.1,0.1))]}

"""
Class which allows for the different (or the same) processing of multiple views of data.
"""

class MultiViewPreprocessing(TransformerMixin):
    def __init__(self, preprocessing_list):
        self.preprocessing_list = preprocessing_list

    def fit(self, views, y=None):
        """
        Fits the associated preprocessing steps to each view.
        Parameters
        ----------
        views
        y

        Returns
        -------

        """
        if len(self.preprocessing_list) == 1:
            self.preprocessing_list = self.preprocessing_list * len(views)
        elif len(self.preprocessing_list) != len(views):
            raise ValueError("Length of preprocessing_list must be 1 (apply the same preprocessing to each view) or equal to the number of views")
        check_Xs(views, enforce_views=range(len(self.preprocessing_list)))
        for view, preprocessing in zip(views, self.preprocessing_list):
            preprocessing.fit(view, y)
        return self

    def transform(self, X, y=None):
        """
        Transforms each view using the associated preprocessing steps.
        Parameters
        ----------
        X
        y

        Returns
        -------

        """
        [check_is_fitted(preprocessing) for preprocessing in self.preprocessing_list]
        check_Xs(X, enforce_views=range(len(self.preprocessing_list)))
        return [preprocessing.transform(view) for view, preprocessing in zip(X, self.preprocessing_list)]

# # define an estimator
estimator = Pipeline([
    ('preprocessing', MultiViewPreprocessing((StandardScaler(),StandardScaler()))),
    ('cca',rCCA(latent_dimensions=latent_dimensions,random_state=rng))
    ])

###############################################################################
## Run GridSearch
##############################################################################

def scorer(estimator, views):
    scores = estimator.score(views)
    return np.mean(scores)

grid = GridSearchCV(estimator,param_grid,scoring=scorer,n_jobs=5,cv=cv)
grid.fit([brain_df,behavior_df],groups=groups)
best_params = grid.best_params_
estimator_best = grid.best_estimator_
X_weights,y_weights = estimator_best.weights

print(f"Best parameters are: {best_params}\n")

Data:

behavior_df.txt brain_df.txt groups.txt

Could you perhaps also re-check on a Linux machine if this is an OS issue?

jameschapman19 commented 11 months ago

Works fine or doesn't work fine?

JohannesWiesner commented 11 months ago

Ah sorry. Works fine on Windows but not on Ubuntu.

jameschapman19 commented 11 months ago

ok - got an error message I can see? I can try and get one myself otherwise XD

JohannesWiesner commented 11 months ago

Here's the complete traceback:

Traceback (most recent call last):

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File ~/work/projects/project_hcp/testing/test_cca.py:108
    grid.fit([brain_df,behavior_df],groups=groups)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/cca_zoo/model_selection/_search.py:208 in fit
    self = BaseSearchCV.fit(self, np.hstack(X), y=y, groups=groups, **fit_params)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/base.py:1151 in wrapper
    return fit_method(estimator, *args, **kwargs)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/model_selection/_search.py:898 in fit
    self._run_search(evaluate_candidates)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/cca_zoo/model_selection/_search.py:199 in _run_search
    evaluate_candidates(param_grid)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/model_selection/_search.py:845 in evaluate_candidates
    out = parallel(

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/utils/parallel.py:65 in __call__
    return super().__call__(iterable_with_config)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1944 in __call__
    return output if self.return_generator else list(output)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1587 in _get_outputs
    yield from self._retrieve()

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1691 in _retrieve
    self._raise_error_fast()

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1726 in _raise_error_fast
    error_job.get_result(self.timeout)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:735 in get_result
    return self._return_or_raise()

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:753 in _return_or_raise
    raise self._result

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11), SIGSEGV(-11)}

JohannesWiesner commented 11 months ago

Okay, tested it on our Linux server. Same error here. Seems to be an OS-issue!

jameschapman19 commented 11 months ago

OK I think it's also possible that its consuming more than expected memory. Will investigate - apologies and thanks for bringing this to my attention!

jameschapman19 commented 11 months ago

I'm thinking trying two things will help diagnose this:

num_workers=1
removing preprocessing

1 will help work out if it is multiprocessing causing the problem, 2 will help work out if it is in the pipeline

JohannesWiesner commented 11 months ago

Yup, with n_jobs=1 I don't have this issue, but of course, that makes sense because apparently, it's a parallelization issue. Removing preprocessing does not change the error.

jameschapman19 commented 11 months ago

Thanks for this. Will have a dig around.

jameschapman19 commented 11 months ago

Hi @JohannesWiesner. From some reading I'm thinking this is versions of scipy/numpy e.g.

because there's nothing substantive that has changed to rCCA which could have caused this (it essentially does a lossless PCA [keeping all components] for efficiency and then sets up an eigenvalue problem which it sends to scipy).

So I think if you give updating scipy/numpy a go that might work?

JohannesWiesner commented 11 months ago

Hm, are you sure? Also tried it with SCCA_PMD and getting the same error. Will update scipy/numpy and also do a sanity check with other sparse CCAs!

JohannesWiesner commented 11 months ago

Ah true, all of those will use the same underlying numpy/scipy functions I guess. You'll get an update tomorrow!

JohannesWiesner commented 11 months ago

Worked: On our Windows machine, these versions are installed:

numpy                     1.23.3           py39h9061af7_0    conda-forge
scipy                         1.9.1            py39h316f440_0    conda-forge
scikit-learn               1.3.0                    pypi_0    pypi
cca-zoo                      2.1.0

Worked: We also tested in a Docker Container (with Debian-Bookworm as the base image) running on Windows:

numpy                     1.21.5                   pypi_0    pypi
scipy                     1.10.1           py39h7360e5f_0    conda-forge
scikit-learn              1.2.2            py39h86b2a18_0    conda-forge
cca-zoo                           2.1.0

Worked: Then I set up a completely fresh conda environment with cca-zoo only:

numpy                    1.25.2
scipy                    1.11.1
scikit-learn             1.3.0
cca-zoo                  2.1.0

Did not work: And in my default conda environment I got these versions:

numpy                 1.21.5                   pypi_0    pypi
scipy                     1.10.1           py38h59b608b_3    conda-forge
scikit-learn           1.3.0            py38hc099248_0    conda-forge
cca-zoo                  2.1.0

JohannesWiesner commented 11 months ago

Hard to say what's causing the issue. Can't really see, how numpy or scipy versions could be responsible. Maybe it's a complex interplay between different packages? Also checked if it's the Python version causing the issue. Created a second environment forcing python to 3.8.17 (to match the python version of my non-working environemt) but couldn't reproduce the error.

jameschapman19 commented 11 months ago

Ergh! And this worked in a previous version?

jameschapman19 commented 11 months ago

It does seem to be a problem elsewhere: https://stackoverflow.com/questions/53757856/segmentation-fault-when-creating-multiprocessing-array

JohannesWiesner commented 11 months ago

Geez, that sounds not trivial. For now, I will just use the working conda environment for the analysis. Let me know, if I should test something out for you. Probably a good idea to implement a testing workflow with different os-runners in the long term.

jameschapman19 commented 11 months ago

Agree about testing with different OS - my ‘hack’ has been that I develop on windows and the automatic tests here use Ubuntu.

although weirdly that suggests the package does work on Ubuntu! So suggests I need to make the numpy/scipy versions explicit (I’ve tended towards laziness/relying on scikit-learn dependencies to be about right)

jameschapman19 commented 11 months ago

So this passes all the tests on Ubuntu:

Installing numpy (1.24.4)

Installing scipy (1.9.3)

Installing scikit-learn (1.3.0)

If that works then I’ll make the dependencies hard to avoid your issue in the future- thanks and apologies! I’m always learning 🙏

jameschapman19 commented 11 months ago

Ah no because I haven’t been testing the jobs>1 behaviour. Will add to the tests

JohannesWiesner commented 11 months ago

Agree about testing with different OS - my ‘hack’ has been that I develop on windows and the automatic tests here use Ubuntu.

although weirdly that suggests the package does work on Ubuntu! So suggests I need to make the numpy/scipy versions explicit (I’ve tended towards laziness/relying on scikit-learn dependencies to be about right)

Should be feasible to implement a CI-Workflow with different os-runners and and then running pytest.py test for each of them.

JohannesWiesner commented 11 months ago

Could send a PR if I have some time

jameschapman19 / cca_zoo

TerminatedWorkerError when using GridSearchCV #177