raphaelvallat / pingouin

Statistical package in Python based on Pandas
https://pingouin-stats.org/
GNU General Public License v3.0
1.61k stars 139 forks source link

Variablename "C" leads to a crash in pg.plot_rm_corr #349

Closed caiusno1 closed 1 year ago

caiusno1 commented 1 year ago

The following code crashes with a not understandable error message:

import pandas as pd
import pingouin as pg

df_test = pd.DataFrame({"C":[1,2,3,4,5,6], "X":[1,2,3,4,5,6], "VP":[1,1,2,2,3,3]})
pg.plot_rm_corr(df_test, x="C", y="X", subject="VP")

Trace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/patsy/compat.py:36, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
     35 try:
---> 36     return f(*args, **kwargs)
     37 except Exception as e:

File /opt/conda/lib/python3.10/site-packages/patsy/eval.py:169, in EvalEnvironment.eval(self, expr, source_name, inner_namespace)
    168 code = compile(expr, source_name, "eval", self.flags, False)
--> 169 return eval(code, {}, VarLookupDict([inner_namespace]
    170                                     + self._namespaces))

File <string>:1, in <module>

TypeError: 'Series' object is not callable

The above exception was the direct cause of the following exception:

PatsyError                                Traceback (most recent call last)
Input In [124], in <cell line: 2>()
      1 df_test = pd.DataFrame({"C":[1,2,3,4,5,6], "X":[1,2,3,4,5,6], "VP":[1,1,2,2,3,3]})
----> 2 pg.plot_rm_corr(df_test, x="C", y="X", subject="VP")

File /opt/conda/lib/python3.10/site-packages/pingouin/plotting.py:1005, in plot_rm_corr(data, x, y, subject, legend, kwargs_facetgrid, kwargs_line, kwargs_scatter)
    996 # Calculate rm_corr
    997 # rmc = pg.rm_corr(data=data, x=x, y=y, subject=subject)
    998 
   (...)
   1002 # Q allows to quote variable that do not meet Python variable name rule
   1003 # e.g. if variable is "weight.in.kg" or "2A"
   1004 formula = "Q('%s') ~ C(Q('%s')) + Q('%s')" % (y, subject, x)
-> 1005 model = ols(formula, data=data).fit()
   1007 # Fitted values
   1008 data["pred"] = model.fittedvalues

File /opt/conda/lib/python3.10/site-packages/statsmodels/base/model.py:200, in Model.from_formula(cls, formula, data, subset, drop_cols, *args, **kwargs)
    197 if missing == 'none':  # with patsy it's drop or raise. let's raise.
    198     missing = 'raise'
--> 200 tmp = handle_formula_data(data, None, formula, depth=eval_env,
    201                           missing=missing)
    202 ((endog, exog), missing_idx, design_info) = tmp
    203 max_endog = cls._formula_max_endog

File /opt/conda/lib/python3.10/site-packages/statsmodels/formula/formulatools.py:63, in handle_formula_data(Y, X, formula, depth, missing)
     61 else:
     62     if data_util._is_using_pandas(Y, None):
---> 63         result = dmatrices(formula, Y, depth, return_type='dataframe',
     64                            NA_action=na_action)
     65     else:
     66         result = dmatrices(formula, Y, depth, return_type='dataframe',
     67                            NA_action=na_action)

File /opt/conda/lib/python3.10/site-packages/patsy/highlevel.py:309, in dmatrices(formula_like, data, eval_env, NA_action, return_type)
    299 """Construct two design matrices given a formula_like and data.
    300 
    301 This function is identical to :func:`dmatrix`, except that it requires
   (...)
    306 See :func:`dmatrix` for details.
    307 """
    308 eval_env = EvalEnvironment.capture(eval_env, reference=1)
--> 309 (lhs, rhs) = _do_highlevel_design(formula_like, data, eval_env,
    310                                   NA_action, return_type)
    311 if lhs.shape[1] == 0:
    312     raise PatsyError("model is missing required outcome variables")

File /opt/conda/lib/python3.10/site-packages/patsy/highlevel.py:164, in _do_highlevel_design(formula_like, data, eval_env, NA_action, return_type)
    162 def data_iter_maker():
    163     return iter([data])
--> 164 design_infos = _try_incr_builders(formula_like, data_iter_maker, eval_env,
    165                                   NA_action)
    166 if design_infos is not None:
    167     return build_design_matrices(design_infos, data,
    168                                  NA_action=NA_action,
    169                                  return_type=return_type)

File /opt/conda/lib/python3.10/site-packages/patsy/highlevel.py:66, in _try_incr_builders(formula_like, data_iter_maker, eval_env, NA_action)
     64 if isinstance(formula_like, ModelDesc):
     65     assert isinstance(eval_env, EvalEnvironment)
---> 66     return design_matrix_builders([formula_like.lhs_termlist,
     67                                    formula_like.rhs_termlist],
     68                                   data_iter_maker,
     69                                   eval_env,
     70                                   NA_action)
     71 else:
     72     return None

File /opt/conda/lib/python3.10/site-packages/patsy/build.py:693, in design_matrix_builders(termlists, data_iter_maker, eval_env, NA_action)
    689 factor_states = _factors_memorize(all_factors, data_iter_maker, eval_env)
    690 # Now all the factors have working eval methods, so we can evaluate them
    691 # on some data to find out what type of data they return.
    692 (num_column_counts,
--> 693  cat_levels_contrasts) = _examine_factor_types(all_factors,
    694                                                factor_states,
    695                                                data_iter_maker,
    696                                                NA_action)
    697 # Now we need the factor infos, which encapsulate the knowledge of
    698 # how to turn any given factor into a chunk of data:
    699 factor_infos = {}

File /opt/conda/lib/python3.10/site-packages/patsy/build.py:443, in _examine_factor_types(factors, factor_states, data_iter_maker, NA_action)
    441 for data in data_iter_maker():
    442     for factor in list(examine_needed):
--> 443         value = factor.eval(factor_states[factor], data)
    444         if factor in cat_sniffers or guess_categorical(value):
    445             if factor not in cat_sniffers:

File /opt/conda/lib/python3.10/site-packages/patsy/eval.py:568, in EvalFactor.eval(self, memorize_state, data)
    567 def eval(self, memorize_state, data):
--> 568     return self._eval(memorize_state["eval_code"],
    569                       memorize_state,
    570                       data)

File /opt/conda/lib/python3.10/site-packages/patsy/eval.py:551, in EvalFactor._eval(self, code, memorize_state, data)
    549 def _eval(self, code, memorize_state, data):
    550     inner_namespace = VarLookupDict([data, memorize_state["transforms"]])
--> 551     return call_and_wrap_exc("Error evaluating factor",
    552                              self,
    553                              memorize_state["eval_env"].eval,
    554                              code,
    555                              inner_namespace=inner_namespace)

File /opt/conda/lib/python3.10/site-packages/patsy/compat.py:43, in call_and_wrap_exc(msg, origin, f, *args, **kwargs)
     39     new_exc = PatsyError("%s: %s: %s"
     40                          % (msg, e.__class__.__name__, e),
     41                          origin)
     42     # Use 'exec' to hide this syntax from the Python 2 parser:
---> 43     exec("raise new_exc from e")
     44 else:
     45     # In python 2, we just let the original exception escape -- better
     46     # than destroying the traceback. But if it's a PatsyError, we can
     47     # at least set the origin properly.
     48     if isinstance(e, PatsyError):

File <string>:1, in <module>

PatsyError: Error evaluating factor: TypeError: 'Series' object is not callable
    Q('X') ~ C(Q('VP')) + Q('C')
             ^^^^^^^^^^

It seems like the error is caused by the usage of the variablename "C" in the dataframe which can be traced back to patsy (https://github.com/pydata/patsy/issues/174). Also when the error is caused by patsy it would be nice if the user can be warned to not use C as a variable name.

remrama commented 1 year ago

Interesting. It seems the same problem arises when using 'Q' as the column/variable name. Only an issue when they are capitalized (basically 'Q' and 'C' are off-limits because they are used in that patsy formula).

Doesn't seem like a big deal since it's such an easy solution (i.e., just change the column name), but I agree it seems reasonable to provide a custom error message that tells the user what's going on, since this is kind of an overwhelming error message. Maybe an assertion check on the input.

raphaelvallat commented 1 year ago

Agreed. We should add the error message at two places in the code:

https://github.com/raphaelvallat/pingouin/blob/8c92b6d4cb9bddcbc329716c9f471eeb15afa6eb/pingouin/parametric.py#L1162-L1169

https://github.com/raphaelvallat/pingouin/blob/8c92b6d4cb9bddcbc329716c9f471eeb15afa6eb/pingouin/plotting.py#L1004

Does either one of you want to submit a PR for this?

Thanks, Raphael