mozilla / PRESC

Performance Robustness Evaluation for Statistical Classifiers
Mozilla Public License 2.0
36 stars 51 forks source link

[Outreachy applications] Visualization of an evaluation metric #6

Closed dzeber closed 4 years ago

dzeber commented 4 years ago

A common theme in classifier tuning and evaluation is to plot a metric against the values of some parameter with repeated runs to assess variability. We would like to have a general utility for producing such plots.

It should take as input at table with columns (x, y1, y2, ..., yk), ie multiple y values for each x, and plot the average y value vs x with the spread of the y-values represented in some way, eg. a band.

Soniyanayak51 commented 4 years ago

@dzeber I would like to work on this.

Sidrah-Madiha commented 4 years ago

@dzeber I have a question for this issue: I am not sure what is meant by spread of y-values, does this mean there should be a plot of x and average y, and there should be a color shading from minimum value of y to maximum value of y at each point?

simmranvermaa commented 4 years ago

I have tried #2 with SVM and KNN on Wine Quality Dataset, I would like to plot the outputs of the same and also work on this. Should I go for something specific, before a general approach @dzeber?

dzeber commented 4 years ago

plot of x and average y, and there should be a color shading from minimum value of y to maximum value of y at each point?

@Sidrah-Madiha yes, something like that would work. It's nice to see some representation of the spread to see how far the individual values tend to be from the mean.

dzeber commented 4 years ago

@simran0117 The ultimate goal would be a general approach. However, I recommend using your previous notebook as a place to develop your idea for this, and then move your code a separate module and make sure it does not depend on the specific model or metric used in the notebook. Also, you should include the notebook in your PR showing an example of your function in action!

mlopatka commented 4 years ago

There have been several questions about this issue, so I'll try to add a bit of context here:

The intention with this issue is to get everyone thinking about the stability of reported performance in ML. Often times a model will be decided upon, trained, and then documented. In the worst case only a single percentage performance is reported, i.e. "This model exhibits correct classification 92% on out of box samples". This does not tell the whole story. Issue #6 is an opportunity to proposed strategies for examining how much variability there is for a whole system: available training data can be split in a multitude of different ways, classifier parameters may be tuned to produce different performance for different data splitting strategies, parameters in both model training and data splitting may affect not only the raw performance (accuracy, recall, precision, F1, etc..). Any changes that lead to different y label predictions over a data set x. As mentioned in the issue, explore what the variation in those y values is/can be depending on the holistic set of parameters that go into building an ML pipeline.