uber / causalml

Uplift modeling and causal inference with machine learning algorithms
Other
4.87k stars 756 forks source link

A error "LinAlgError: Singular matrix" happen when run FilterSelect with LR method #781

Open xhxt2008 opened 3 weeks ago

xhxt2008 commented 3 weeks ago

Describe the bug When i run below code with my data :

# LR Filter with order 2`
method = 'LR'
f_imp = filter_method.get_importance(data_train, data_train.columns.tolist(), y_name='label', method=method, experiment_group_column='ff_rate', control_group=0, treatment_group=1, order=2)
f_imp.head()

An Error occured:

LinAlgError                               Traceback (most recent call last)
Cell In[40], line 3
      1 # LR Filter with order 2
      2 method = 'LR'
----> 3 f_imp = filter_method.get_importance(data_train, data_train.columns.tolist(), y_name='label', method=method, experiment_group_column='ff_rate', control_group=0, treatment_group=1, order=2)
      4 f_imp.head()

File /usr/local/lib64/python3.9/site-packages/causalml/feature_selection/filters.py:642, in FilterSelect.get_importance(self, data, features, y_name, method, experiment_group_column, control_group, treatment_group, n_bins, null_impute, order, disp)
    638     data["treatment_indicator"] = 0
    639     data.loc[
    640         data[experiment_group_column] == treatment_group, "treatment_indicator"
    641     ] = 1
--> 642     all_result = self.filter_LR(
    643         data=data,
    644         disp=disp,
    645         treatment_indicator="treatment_indicator",
    646         features=features,
    647         y_name=y_name,
    648         order=order,
    649     )
    650 else:
    651     all_result = self.filter_D(
    652         data=data,
    653         method=method,
   (...)
    659         null_impute=null_impute,
    660     )

File /usr/local/lib64/python3.9/site-packages/causalml/feature_selection/filters.py:247, in FilterSelect.filter_LR(self, data, treatment_indicator, features, y_name, order, disp)
    245 all_result = pd.DataFrame()
    246 for x_name_i in features:
--> 247     one_result = self._filter_LR_one_feature(
    248         data=data,
    249         treatment_indicator=treatment_indicator,
    250         feature_name=x_name_i,
    251         y_name=y_name,
    252         order=order,
    253         disp=disp,
    254     )
    255     all_result = pd.concat([all_result, one_result])
    257 all_result = all_result.sort_values(by="score", ascending=False)

File /usr/local/lib64/python3.9/site-packages/causalml/feature_selection/filters.py:200, in FilterSelect._filter_LR_one_feature(data, treatment_indicator, feature_name, y_name, order, disp)
    198 # Full model (with interaction)
    199 model_r = sm.Logit(Y, X[x_name_r])
--> 200 result_r = model_r.fit(disp=disp)
    202 model_f = sm.Logit(Y, X[x_name_f])
    203 result_f = model_f.fit(disp=disp)

File ~/.local/lib/python3.9/site-packages/statsmodels/discrete/discrete_model.py:2599, in Logit.fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
   2596 @Appender(DiscreteModel.fit.__doc__)
   2597 def fit(self, start_params=None, method='newton', maxiter=35,
   2598         full_output=1, disp=1, callback=None, **kwargs):
-> 2599     bnryfit = super().fit(start_params=start_params,
   2600                           method=method,
   2601                           maxiter=maxiter,
   2602                           full_output=full_output,
   2603                           disp=disp,
   2604                           callback=callback,
   2605                           **kwargs)
   2607     discretefit = LogitResults(self, bnryfit)
   2608     return BinaryResultsWrapper(discretefit)

File ~/.local/lib/python3.9/site-packages/statsmodels/discrete/discrete_model.py:243, in DiscreteModel.fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs)
    240 else:
    241     pass  # TODO: make a function factory to have multiple call-backs
--> 243 mlefit = super().fit(start_params=start_params,
    244                      method=method,
    245                      maxiter=maxiter,
    246                      full_output=full_output,
    247                      disp=disp,
    248                      callback=callback,
    249                      **kwargs)
    251 return mlefit

File ~/.local/lib/python3.9/site-packages/statsmodels/base/model.py:582, in LikelihoodModel.fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs)
    580     Hinv = cov_params_func(self, xopt, retvals)
    581 elif method == 'newton' and full_output:
--> 582     Hinv = np.linalg.inv(-retvals['Hessian']) / nobs
    583 elif not skip_hessian:
    584     H = -1 * self.hessian(xopt)

File <__array_function__ internals>:200, in inv(*args, **kwargs)

File /usr/local/lib64/python3.9/site-packages/numpy/linalg/linalg.py:538, in inv(a)
    536 signature = 'D->D' if isComplexType(t) else 'd->d'
    537 extobj = get_linalg_error_extobj(_raise_linalgerror_singular)
--> 538 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj)
    539 return wrap(ainv.astype(result_t, copy=False))

File /usr/local/lib64/python3.9/site-packages/numpy/linalg/linalg.py:89, in _raise_linalgerror_singular(err, flag)
     88 def _raise_linalgerror_singular(err, flag):
---> 89     raise LinAlgError("Singular matrix")

LinAlgError: Singular matrix

umath_linalg.inv got a singular matrix.

ras44 commented 2 weeks ago

hi @xhxt2008 , just looking at this quickly, but here are two potential reasons behind this:

From my understanding of the code, it looks like it's attempting to fit an LR model to each of the features. Either of the two cases above could cause the matrix to be ill-conditioned or singular.

xhxt2008 commented 2 weeks ago

Thank you for replay!

The sample size is over 200,000, is it too small. But F filter was well worked.

ras44 commented 2 weeks ago

The sample size doesn't look inherently "small", but it does depend on the distributions of your features. If any feature is constant, it will result in a singular matrix which can't be inverted. I think this could also happen if your features are highly skewed (so maybe not constant, but "almost constant").