juaml / julearn

Forschungszentrum Jülich Machine Learning Library
https://juaml.github.io/julearn
GNU Affero General Public License v3.0
30 stars 19 forks source link

[BUG]: Parenthesis '(' or ')' in column names will fail #226

Open fraimondo opened 1 year ago

fraimondo commented 1 year ago

Is there an existing issue for this?

Current Behavior

I just tried run_cross_validation with X_types={"continous": [".*"]. It failed due to some columns having "(" or ")" in the names.

Expected Behavior

No failure

Steps To Reproduce

Just run any example with "(" in the name.

Environment

Julearn dev in julearn_sk_pandas branch

Relevant log output

No response

Anything else?

No response

samihamdan commented 1 year ago

Wait is that valid? Like I know that we can set apply_to = '.*' to select all types, but we also allow to set one type to all column names by '.*'?

If so maybe add a failing test to the branch so we can fix it before merge.

fraimondo commented 1 year ago

The '.*' was just an example. I'm also using it to define types like "ALFF" : "alff_.*".

The issue is the regular expression interpreting "(" wrongly.

fraimondo commented 8 months ago

Ok, so @LeSasse also found this issue a bit annoying.

Mainly, the problem is that "(" and ")" are special characters in regular expressions. So there's no straightforward way of solving this issue "automatically": basically, if we replace "(" for "(", then we are somehow allowing a subset of regular expressions.

My take is that it is unlikely that someone will use a complex regexp with groups, but we should still allow that.

Proposal:

1) Escape "(" and ")" in the regexp by default 2) Add a config flag that disables this, just in case users want to actually use complex regexp.

fraimondo commented 8 months ago

Based on the conversation with Leo, another approach would be to do a check and give the right error messages.

Indeed, thinking it with a clear head, one could simply escape the characters in the column names. However, if the user specifies these column names in X or X_types, then the regular expression issue appears again.

So one option is that we check the column names in the dataframe and warn the users if we detect any special character. In this warning, we can give the users the right julearn.config.set_config parameter, like enable_auto_escape_parenthesis, etc. etc.

So basically a one-liner solution without going into renaming columns.