Misreporting of combined scores in ramp_blend_submissions

paris-saclay-cds / ramp-workflow

Toolkit for building predictive workflows on top of pydata (pandas, scikit-learn, pytorch, keras, etc.).

https://paris-saclay-cds.github.io/ramp-docs/

BSD 3-Clause "New" or "Revised" License

68 stars 43 forks source link

Misreporting of combined scores in ramp_blend_submissions #275

Closed ntraut closed 2 years ago

ntraut commented 3 years ago

When we run ramp_blend_submissions we get two different combined auc scores: "Combined bagged scores" and "Foldwise best bagged scores". For example:

----------------------------
Combined bagged scores
----------------------------
        score    auc
        valid  0.514
        test   0.806
----------------------------
Foldwise best bagged scores
----------------------------
        score    auc
        valid  0.515
        test   0.803

But when we compute the auc from the combined predictions which are in the folder submissions/training_output, we see that what is reported in "Combined bagged scores" corresponds to the auc from y_pred_foldwise_best_bagged_valid.csv and "Foldwise best bagged scores" corresponds to the auc from y_pred_combined_bagged_test.csv.

So which one is right, and incidentally, is there a way of combining which should be preferred over the other?

agramfort commented 3 years ago

@kegl @albertcthomas @rth any thought on this?

albertcthomas commented 3 years ago

Unfortunately, I am not familiar with the blending feature and the corresponding part of the code.

agramfort commented 3 years ago

@kegl we need you here :)

ntraut commented 3 years ago

As nobody seems to know I tried to understand by looking at the code.

In the function blend_submissions: https://github.com/paris-saclay-cds/ramp-workflow/blob/cbc3e831d6598d69810bd1bfe5866e6cbf670374/rampwf/utils/testing.py#L175-L178 the one which does not go with the other variables is the name of the csv, which is given to the function bag_submissions with the argument score_f_name_prefix. So the real scores are the ones reported by the script and and the names of the csv files are wrong.

For the difference between foldwise and combined, from what I understand in foldwise we just take the most performant submission in each fold, whereas in combined the real blending occurs, i.e. in each fold we try to combine a set of submissions to obtain the best possible score, but doing so the combination is also foldwise, which may be the source of the confusion.

agramfort commented 3 years ago

can you send a doc PR to document this? that will be useful in the future :) thx

ntraut commented 2 years ago

I did a PR to fix the issue but I don't know where we should document the different outputs of ramp_blend_submissions, do you have an idea for this?

kegl commented 2 years ago

We should document it here: https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/command_line.html

Simply add a line to https://github.com/paris-saclay-cds/ramp-workflow/blob/master/doc/command_line.rst

@ntraut if you do it in the current PR #285 , we can merge it immediately.

Thanks a lot for spotting this!