Evaluation measures duplicated or not present / no measure for imbalanced data available

amueller commented 5 years ago

Related: #20

Currently no measure is computed that's useful for highly imbalanced classes. Take for example sick: https://www.openml.org/t/3021

I would like to see the "mean" measures be computed in particular (they also are helpful for comparison with D3M, cc @joaquinvanschoren).

On the other hand, the "weighted" measures are not computed but seem to be duplicates of the measure without prefix, which is also weighted by class size: https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure https://www.openml.org/a/evaluation-measures/f-measure

Though that's not entirely clear from the documentation. If the f-measure documentation is actually accurate (which I don't think it is), that would be worse because it's unclear for which class the f-measure is reported.

joaquinvanschoren commented 5 years ago

IIRC we just use the WEKA evaluation class in the evaluation engine, which by default computes the weighted average for all class-specific measure. Hence, if you look at f-measure, you actually see the weighted average. I agree that this is confusing.

To check, let's take this run: https://www.openml.org/r/9199162 The 'large' number that you see with the F-measure is the one also used on the task page. And if you compute the weighted F-measure you can see that this is indeed the value you expect: 0.9917 * (3541/3772) + 0.8625 (231/3772) = 0.9838

If you check the API: https://www.openml.org/api/v1/run/9199162 You can see that it returns the weighted score but simply called 'f_measure'

The evaluation measure documentation is clearly wrong, and https://www.openml.org/a/evaluation-measures/f-measure doesn't say anything about weighting.

What to do... Changing the naming in the API/database would be a very big change. It's probably best to fix the documentation, explaining that the non-prefixed versions compute the weighted average and removing the confusing 'mean-weighted-f-measure'?

Thoughts?

On Wed, 5 Jun 2019 at 16:48, Andreas Mueller notifications@github.com wrote:

Related: #20 https://github.com/openml/EvaluationEngine/issues/20

Currently no measure is computed that's useful for highly imbalanced classes. Take for example sick: https://www.openml.org/t/3021

I would like to see the "mean" measures be computed in particular (they also are helpful for comparison with D3M, cc @joaquinvanschoren https://github.com/joaquinvanschoren).

On the other hand, the "weighted" measures are not computed but seem to be duplicates of the measure without prefix, which is also weighted by class size: https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure https://www.openml.org/a/evaluation-measures/f-measure

Though that's not entirely clear from the documentation. If the f-measure documentation is actually accurate (which I don't think it is), that would be worse because it's unclear for which class the f-measure is reported.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/EvaluationEngine/issues/27?email_source=notifications&email_token=AANFAV7YKEJDD2RAI4VFWVLPY7G3TA5CNFSM4HT2HR72YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GXZJILQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AANFAV2SS3IJWSWHQIRL4ADPY7G3TANCNFSM4HT2HR7Q .

amueller commented 5 years ago

yes, I agree with your conclusion. Let's just remove the weighted one and fix the docs.

Do you have comments on computing the other one, the mean f-measure?

joaquinvanschoren commented 5 years ago

AFAIK we don't compute the mean f-measure in the backend, you'd need to grab the per-class scores and average yourself I'm afraid. @janvanrijn: do you feel like adding this to the evaluation engine?

On Wed, 5 Jun 2019 at 21:59, Andreas Mueller notifications@github.com wrote:

yes, I agree with your conclusion. Let's just remove the weighted one and fix the docs.

Do you have comments on computing the other one, the mean f-measure?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/EvaluationEngine/issues/27?email_source=notifications&email_token=AANFAV5RTOZVBTD7KB5YJVLPZALLDA5CNFSM4HT2HR72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXA267Q#issuecomment-499232638, or mute the thread https://github.com/notifications/unsubscribe-auth/AANFAV3UUWSHRRX6IW5QQXDPZALLDANCNFSM4HT2HR7Q .

amueller commented 5 years ago

@joaquinvanschoren why is it in the drop-down then? ;)

janvanrijn commented 5 years ago

do you feel like adding this to the evaluation engine?

Not sure if adding an additional `unweighted' version would be a great idea, as these tables already put a massive load on our storage. I am open to updates in the API / evaluation engine that make this more convenient though.

joaquinvanschoren commented 5 years ago

@janvanrijn: That would work!

amueller commented 5 years ago

I'm not sure I follow. What are the entries in the drop-down based on if not the things in the evaluation engine?

janvanrijn commented 5 years ago

I would presume this list: https://www.openml.org/api/v1/evaluationmeasure/list

amueller commented 5 years ago

Well ok that's a response from the backend sever, right? so that's generated from the database? Shouldn't there be some synchronization between the metrics in the database and the metrics computed by the evaluation engine?

joaquinvanschoren commented 5 years ago

The API returns a list of all measures known to OpenML: https://www.openml.org/api/v1/evaluationmeasure/list

But indeed not all of those are returned all the time (some are never, apparently).

I could add a check for every measure to see if any of the runs contains that measure. I think I didn't do this before since it's not exactly cheap...

amueller commented 5 years ago

I think it would be more helpful to

Have a list of what the evaluation engine computes
Only show the things in the drop down menu that are available for that particular run

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

amueller commented 5 years ago

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

joaquinvanschoren commented 5 years ago

Only show the things in the drop down menu that are available for that particular run

You mean for that particular task?

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

It's a great time to suggest which one you'd like :).

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

Yes...

janvanrijn commented 5 years ago

Have a list of what the evaluation engine computes

Probably, a mapping between task types and what an evaluation engine computes. Also, officially, there can be multiple evaluation engines.

amueller commented 5 years ago

You mean for that particular task?

yes, sorry

I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.

Macro f1 would be good for D3M, otherwise I'd probably prefer macro average recall and/or macro average AUC.

also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?

yes

That seems.... kinda circular? So that's just an arbitrary list? alright...

joaquinvanschoren commented 5 years ago

As Jan suggested, the API could compute the macro-averaged precision, recall, f1, and auc on the fly based on the per-class scores and return them.

amueller commented 5 years ago

not sure what "on the fly" means here.

joaquinvanschoren commented 5 years ago

Note: for this to show up in the old frontend I'd need to finish the new indexer (which works on top of the API rather than on the database).

not sure what "on the fly" means here.

As Jan explained, computing these in advance would add many millions of rows to the database. The API could instead get the per-class scores, compute the macro-averages, and then return them in its response.

amueller commented 5 years ago

@joaquinvanschoren ok but then we couldn't show them on the website, right? There's hundreds of runs on a given dashboard and that would never finish in time.

joaquinvanschoren commented 5 years ago

It would slow down the response from the API, yes. That in turn may slow down the website.

Hard to say what is faster. Computing them on the fly means that the SQL query is equally fast but the extra computations may slow down the final response. Adding them to the database may slow down the SQL query a bit but keeps the response writing equally fast.

amueller commented 5 years ago

I don't know how slow the database would get with adding them to the database but on the fly doesn't seem feasible to me. For a medium sized dataset this could easily take a second per run, and there might be 10000 runs to render. How many instances of the evaluation server do we run in parallel?

joaquinvanschoren commented 5 years ago

Oh, but we wouldn't compute these from the predictions. We already store the per-class scores for all runs in the database. It would just be a matter of fetching them and computing the average.

amueller commented 5 years ago

oh, right, my bad.

openml / EvaluationEngine

Evaluation measures duplicated or not present / no measure for imbalanced data available #27