kearnz / autoimpute

Python package for Imputation Methods
MIT License
237 stars 19 forks source link

LinAlgError: When `allow_singular is False`, the input matrix must be symmetric positive definite. #82

Open tybuz2021 opened 1 year ago

tybuz2021 commented 1 year ago

Thank you for your library. I found the errors when using this datasets. I tried various way but still cannot resolve:

temp = pd.read_csv('sample for github.csv', low_memory = False)

print_header = lambda msg: print(f"{msg}\n{'-'*len(msg)}")

from autoimpute.imputations import SingleImputer
print_header("Imputing missing data in one line of code with the default SingleImputer")
data_imputed_once = SingleImputer().fit_transform(temp)
print("Imputation Successful!")

I received this error:

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 28 seconds.

LinAlgError Traceback (most recent call last) Input In [27], in <cell line: 5>() 3 from autoimpute.imputations import SingleImputer 4 print_header("Imputing missing data in one line of code with the default SingleImputer") ----> 5 data_imputed_once = SingleImputer().fit_transform(temp) 6 print("Imputation Successful!")

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/autoimpute/imputations/dataframe/single_imputer.py:313, in SingleImputer.fit_transform(self, X, y, trans_kwargs) 301 def fit_transform(self, X, y=None, trans_kwargs): 302 """Convenience method to fit then transform the same dataset. 303 304 Args: (...) 311 X (pd.DataFrame): imputed in place or copy of original. 312 """ --> 313 return self.fit(X, y).transform(X, **trans_kwargs)

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/autoimpute/utils/checks.py:61, in check_data_structure..wrapper(d, *args, *kwargs) 59 err = f"Neither {d_err} nor {a_err} are of type pd.DataFrame" 60 raise TypeError(err) ---> 61 return func(d, args, **kwargs)

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/autoimpute/utils/checks.py:126, in check_missingness..wrapper(d, *args, *kwargs) 123 raise ValueError("Time series columns must be fully complete.") 125 # return func if no missingness violations detected, then return wrap --> 126 return func(d, args, **kwargs)

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/autoimpute/utils/checks.py:173, in check_nan_columns..wrapper(d, *args, *kwargs) 171 err = f"All values missing in column(s) {nc}. Should be removed." 172 raise ValueError(err) --> 173 return func(d, args, **kwargs)

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/autoimpute/imputations/dataframe/single_imputer.py:298, in SingleImputer.transform(self, X, imp_ixs, **trans_kwargs) 296 X.loc[impix, column] = imputer.impute(x, k=k) 297 else: --> 298 X.loc[impix, column] = imputer.impute(x) 299 return X

File /opt/anaconda3/envs/Python3_104V1/lib/python3.10/site-packages/autoimpute/imputations/series/default.py:400, in DefaultPredictiveImputer.impute(self, X) 398 def impute(self, X): 399 """Defer transform to the DefaultBaseImputer.""" --> 400 X = super().impute(X) 401 return X_

File /opt/anaconda3/envs/Python3_104V1/lib/python3.10/site-packages/autoimpute/imputations/series/default.py:214, in DefaultBaseImputer.impute(self, X) 212 # ensure that param is not none, which indicates time series column 213 if imp: --> 214 X = imp.impute(X) 215 return X_

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/autoimpute/imputations/series/pmm.py:187, in PMMImputer.impute(self, X) 183 # get the mean and covariance of the multivariate betas 184 # betas assumed multivariate normal by linear reg rules 185 # sample beta w/ cov structure to create realistic variability 186 beta_means, betacov = beta.mean(0), np.cov(beta_.T) --> 187 beta_bayes = np.array(multivariate_normal(beta_means, beta_cov).rvs()) 189 # predictions for missing y, using bayes alpha + coeff samples 190 # use these preds for nearest neighbor search from reg results 191 # neighbors are nearest from prediction model fit on observed 192 # imputed values are actual y vals corresponding to nearest neighbors 193 # therefore, this is a form of "hot-deck" imputation 194 y_pred_bayes = alpha_bayes + beta_bayes.dot(X.T)

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/scipy/stats/_multivariate.py:364, in multivariate_normal_gen.call(self, mean, cov, allow_singular, seed) 359 def call(self, mean=None, cov=1, allow_singular=False, seed=None): 360 """Create a frozen multivariate normal distribution. 361 362 See multivariate_normal_frozen for more information. 363 """ --> 364 return multivariate_normal_frozen(mean, cov, 365 allow_singular=allow_singular, 366 seed=seed)

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/scipy/stats/_multivariate.py:734, in multivariate_normal_frozen.init(self, mean, cov, allow_singular, seed, maxpts, abseps, releps) 731 self._dist = multivariate_normal_gen(seed) 732 self.dim, self.mean, self.cov = self._dist._process_parameters( 733 None, mean, cov) --> 734 self.cov_info = _PSD(self.cov, allow_singular=allow_singular) 735 if not maxpts: 736 maxpts = 1000000 * self.dim

File /opt/anaconda3/envs/Python3_10_4V1/lib/python3.10/site-packages/scipy/stats/_multivariate.py:167, in _PSD.init(self, M, cond, rcond, lower, check_finite, allow_singular) 164 if len(d) < len(s) and not allow_singular: 165 msg = ("When allow_singular is False, the input matrix must be " 166 "symmetric positive definite.") --> 167 raise np.linalg.LinAlgError(msg) 168 s_pinv = _pinv_1d(s, eps) 169 U = np.multiply(u, np.sqrt(s_pinv))

LinAlgError: When allow_singular is False, the input matrix must be symmetric positive definite.

sample for github.csv

svaningelgem commented 1 year ago

I face the same issue with my dataset.

I changed imputations/series/pmm.py around line 189 to:

        try:
            beta_bayes = np.array(multivariate_normal(beta_means, beta_cov).rvs())
        except np.linalg.LinAlgError:
            beta_bayes = np.array(multivariate_normal(beta_means, beta_cov, allow_singular=True).rvs())

This makes it continue, but I'm not statistically knowledgeable enough to know what this actually does. So it might be doing the wrong thing underneath.