[BUG]: Target Transformation with reversivle transformers leads to faulty scoring

samihamdan commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

Using z-scoring leads to a wrong scoring as probably we evaluate the correctly inverse-transformed prediction to a scored ground truth. You can see that as r2_corr seems fine but r2 shows a high error as its scale sensitive. See the following image.

Expected Behavior

scoring with inversible scorers scores against the original ground truth

Steps To Reproduce

Environment

anyio==4.0.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.4.0
async-lru==2.0.4
attrs==23.1.0
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
comm==0.1.4
contourpy==1.1.0
cycler==0.11.0
debugpy==1.6.7.post1
decorator==5.1.1
defusedxml==0.7.1
executing==1.2.0
fastjsonschema==2.18.0
fonttools==4.42.1
fqdn==1.5.1
idna==3.4
ipykernel==6.25.2
ipython==8.15.0
ipython-genutils==0.2.0
ipywidgets==8.1.0
isoduration==20.11.0
jedi==0.19.0
Jinja2==3.1.2
joblib==1.3.2
json5==0.9.14
jsonpointer==2.4
jsonschema==4.19.0
jsonschema-specifications==2023.7.1
julearn==0.3.0
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.7.0
jupyter-lsp==2.2.0
jupyter_client==8.3.1
jupyter_core==5.3.1
jupyter_server==2.7.3
jupyter_server_terminals==0.4.4
jupyterlab==4.0.5
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.8
jupyterlab_server==2.24.0
kiwisolver==1.4.5
MarkupSafe==2.1.3
matplotlib==3.7.2
matplotlib-inline==0.1.6
mistune==3.0.1
nbclient==0.8.0
nbconvert==7.8.0
nbformat==5.9.2
nest-asyncio==1.5.7
notebook==7.0.3
notebook_shim==0.2.3
numpy==1.25.2
overrides==7.4.0
packaging==23.1
pandas==2.0.3
pandocfilters==1.5.0
parso==0.8.3
patsy==0.5.3
pexpect==4.8.0
pickleshare==0.7.5
Pillow==10.0.0
platformdirs==3.10.0
prometheus-client==0.17.1
prompt-toolkit==3.0.39
psutil==5.9.5
ptyprocess==0.7.0
pure-eval==0.2.2
pycparser==2.21
Pygments==2.16.1
pyparsing==3.0.9
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3.post1
PyYAML==6.0.1
pyzmq==25.1.1
qtconsole==5.4.4
QtPy==2.4.0
referencing==0.30.2
requests==2.31.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.10.2
scikit-learn==1.3.0
scipy==1.11.2
seaborn==0.12.2
Send2Trash==1.8.2
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
stack-data==0.6.2
statsmodels==0.14.0
terminado==0.17.1
threadpoolctl==3.2.0
tinycss2==1.2.1
tornado==6.3.3
traitlets==5.9.0
tzdata==2023.3
uri-template==1.3.0
urllib3==2.0.4
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.6.2
widgetsnbextension==4.0.8

Relevant log output

No response

Anything else?

No response

fraimondo commented 1 year ago

Here's where the Extended Scorer transforms the y (always)

https://github.com/juaml/julearn/blob/dba30719ec47527400682bd3bbb833207b119042/julearn/scoring/available_scorers.py#L178-L182

This is where the scorers are "wrapped" only if the extend parameter is true:

https://github.com/juaml/julearn/blob/dba30719ec47527400682bd3bbb833207b119042/julearn/scoring/available_scorers.py#L161-L164

This is where the check_scoring passes the wrap_score parameter as the extend parameter to _extend_scorer https://github.com/juaml/julearn/blob/dba30719ec47527400682bd3bbb833207b119042/julearn/scoring/available_scorers.py#L127-L160

This is where check_scoring is called in run_cross_validation:

https://github.com/juaml/julearn/blob/dba30719ec47527400682bd3bbb833207b119042/julearn/api.py#L348-L350

Here are the two lines that set wrap_score to True, based on the presence of a target transformer:

https://github.com/juaml/julearn/blob/dba30719ec47527400682bd3bbb833207b119042/julearn/api.py#L251

https://github.com/juaml/julearn/blob/dba30719ec47527400682bd3bbb833207b119042/julearn/api.py#L321

So we always use the extended scorer, even if the y transformer is reversible. And in this specific case, scikit-learn transforms the y_pred back to the original space and julearn transforms the y_true to the transformed space, comparing bananas with potatoes.

harveybi commented 12 months ago

Also want to report something I observed before: Although I got the wrong scaled metrics when z-score target, but I found the Pearson correlation values for z-score target or not are the same. Is that expected? Since I found the metrics are always different when I z-score target or not by myself. Also example: https://chat.openai.com/share/f625997a-eb50-40af-9cbb-89d450cdb364

juaml / julearn