Hypothesis: a model is robust if it works overally good with differently transformed data.
[ ] What is the model that best performs in average? -> compute the model AUC on each dataset and get the average/median.
[ ] Is the model performing better or worse on other datasets, and how much? -> for one model, compute the AUC on each dataset and get the delta. -> show a plot of the deltas & compute the average delta.
Problem: some datasets might be very similar, so one model might be averagely good because of the fact that it works well on those very similar datasets.
[ ] A model that is robust should not work super bad on any other dataset -> compute the min AUC for each model. -> also show the worse delta
Hypothesis: a model is robust if it works overally good with differently transformed data.
Problem: some datasets might be very similar, so one model might be averagely good because of the fact that it works well on those very similar datasets.