Open amueller opened 5 years ago
IIRC we just use the WEKA evaluation class in the evaluation engine, which by default computes the weighted average for all class-specific measure. Hence, if you look at f-measure, you actually see the weighted average. I agree that this is confusing.
To check, let's take this run: https://www.openml.org/r/9199162 The 'large' number that you see with the F-measure is the one also used on the task page. And if you compute the weighted F-measure you can see that this is indeed the value you expect: 0.9917 * (3541/3772) + 0.8625 (231/3772) = 0.9838
If you check the API: https://www.openml.org/api/v1/run/9199162 You can see that it returns the weighted score but simply called 'f_measure'
The evaluation measure documentation is clearly wrong, and https://www.openml.org/a/evaluation-measures/f-measure doesn't say anything about weighting.
What to do... Changing the naming in the API/database would be a very big change. It's probably best to fix the documentation, explaining that the non-prefixed versions compute the weighted average and removing the confusing 'mean-weighted-f-measure'?
Thoughts?
On Wed, 5 Jun 2019 at 16:48, Andreas Mueller notifications@github.com wrote:
Related: #20 https://github.com/openml/EvaluationEngine/issues/20
Currently no measure is computed that's useful for highly imbalanced classes. Take for example sick: https://www.openml.org/t/3021
I would like to see the "mean" measures be computed in particular (they also are helpful for comparison with D3M, cc @joaquinvanschoren https://github.com/joaquinvanschoren).
On the other hand, the "weighted" measures are not computed but seem to be duplicates of the measure without prefix, which is also weighted by class size: https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure https://www.openml.org/a/evaluation-measures/f-measure
Though that's not entirely clear from the documentation. If the f-measure documentation is actually accurate (which I don't think it is), that would be worse because it's unclear for which class the f-measure is reported.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/EvaluationEngine/issues/27?email_source=notifications&email_token=AANFAV7YKEJDD2RAI4VFWVLPY7G3TA5CNFSM4HT2HR72YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GXZJILQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AANFAV2SS3IJWSWHQIRL4ADPY7G3TANCNFSM4HT2HR7Q .
yes, I agree with your conclusion. Let's just remove the weighted one and fix the docs.
Do you have comments on computing the other one, the mean f-measure?
AFAIK we don't compute the mean f-measure in the backend, you'd need to grab the per-class scores and average yourself I'm afraid. @janvanrijn: do you feel like adding this to the evaluation engine?
On Wed, 5 Jun 2019 at 21:59, Andreas Mueller notifications@github.com wrote:
yes, I agree with your conclusion. Let's just remove the weighted one and fix the docs.
Do you have comments on computing the other one, the mean f-measure?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/EvaluationEngine/issues/27?email_source=notifications&email_token=AANFAV5RTOZVBTD7KB5YJVLPZALLDA5CNFSM4HT2HR72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXA267Q#issuecomment-499232638, or mute the thread https://github.com/notifications/unsubscribe-auth/AANFAV3UUWSHRRX6IW5QQXDPZALLDANCNFSM4HT2HR7Q .
@joaquinvanschoren why is it in the drop-down then? ;)
do you feel like adding this to the evaluation engine?
Not sure if adding an additional `unweighted' version would be a great idea, as these tables already put a massive load on our storage. I am open to updates in the API / evaluation engine that make this more convenient though.
@janvanrijn: That would work!
I'm not sure I follow. What are the entries in the drop-down based on if not the things in the evaluation engine?
I would presume this list: https://www.openml.org/api/v1/evaluationmeasure/list
Well ok that's a response from the backend sever, right? so that's generated from the database? Shouldn't there be some synchronization between the metrics in the database and the metrics computed by the evaluation engine?
The API returns a list of all measures known to OpenML: https://www.openml.org/api/v1/evaluationmeasure/list
But indeed not all of those are returned all the time (some are never, apparently).
I could add a check for every measure to see if any of the runs contains that measure. I think I didn't do this before since it's not exactly cheap...
I think it would be more helpful to
I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.
also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?
- Only show the things in the drop down menu that are available for that particular run
You mean for that particular task?
I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.
It's a great time to suggest which one you'd like :).
also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?
Yes...
Have a list of what the evaluation engine computes
Probably, a mapping between task types and what an evaluation engine computes. Also, officially, there can be multiple evaluation engines.
You mean for that particular task?
yes, sorry
I think it would be good to have some meaningful measure of performance for imbalanced multi-class classification computed by default.
Macro f1 would be good for D3M, otherwise I'd probably prefer macro average recall and/or macro average AUC.
also @joaquinvanschoren what's the definition of known in this? Is it "it's in this database"?
yes
That seems.... kinda circular? So that's just an arbitrary list? alright...
As Jan suggested, the API could compute the macro-averaged precision, recall, f1, and auc on the fly based on the per-class scores and return them.
not sure what "on the fly" means here.
Note: for this to show up in the old frontend I'd need to finish the new indexer (which works on top of the API rather than on the database).
not sure what "on the fly" means here.
As Jan explained, computing these in advance would add many millions of rows to the database. The API could instead get the per-class scores, compute the macro-averages, and then return them in its response.
@joaquinvanschoren ok but then we couldn't show them on the website, right? There's hundreds of runs on a given dashboard and that would never finish in time.
It would slow down the response from the API, yes. That in turn may slow down the website.
Hard to say what is faster. Computing them on the fly means that the SQL query is equally fast but the extra computations may slow down the final response. Adding them to the database may slow down the SQL query a bit but keeps the response writing equally fast.
I don't know how slow the database would get with adding them to the database but on the fly doesn't seem feasible to me. For a medium sized dataset this could easily take a second per run, and there might be 10000 runs to render. How many instances of the evaluation server do we run in parallel?
Oh, but we wouldn't compute these from the predictions. We already store the per-class scores for all runs in the database. It would just be a matter of fetching them and computing the average.
oh, right, my bad.
Related: #20
Currently no measure is computed that's useful for highly imbalanced classes. Take for example sick: https://www.openml.org/t/3021
I would like to see the "mean" measures be computed in particular (they also are helpful for comparison with D3M, cc @joaquinvanschoren).
On the other hand, the "weighted" measures are not computed but seem to be duplicates of the measure without prefix, which is also weighted by class size: https://www.openml.org/a/evaluation-measures/mean-weighted-f-measure https://www.openml.org/a/evaluation-measures/f-measure
Though that's not entirely clear from the documentation. If the f-measure documentation is actually accurate (which I don't think it is), that would be worse because it's unclear for which class the f-measure is reported.