[feat] report robustness metric for the models

Hypothesis: a model is robust if it works overally good with differently transformed data.

[ ] What is the model that best performs in average? -> compute the model AUC on each dataset and get the average/median.
[ ] Is the model performing better or worse on other datasets, and how much? -> for one model, compute the AUC on each dataset and get the delta. -> show a plot of the deltas & compute the average delta.

Problem: some datasets might be very similar, so one model might be averagely good because of the fact that it works well on those very similar datasets.

[ ] A model that is robust should not work super bad on any other dataset -> compute the min AUC for each model. -> also show the worse delta

nf-core / deepmodeloptim

[feat] report robustness metric for the models #101