openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
321 stars 81 forks source link

Metric output standardisation #575

Closed scottgigante-immunai closed 1 year ago

scottgigante-immunai commented 2 years ago

Taking the average of ranks is advised against for biomedical image analysis competitions (https://doi.org/10.1038/s41467-018-07619-7). From the paper:

According to bootstrapping experiments (Figs.3 and 4), single-metric rankings are statistically highly significantly more robust when (1) the mean rather than the median is used for aggregation and (2) the ranking is performed after the aggregation.

Originally posted by @LuckyMD in https://github.com/openproblems-bio/openproblems/issues/566#issuecomment-1246925162

scottgigante-immunai commented 2 years ago

Proposed solution: each task requires a sample dataset with perfect performance and a sample dataset with "bad" (random, ideally) performance. All metrics are scaled between 0 (bad) and 1 (perfect).

scottgigante-immunai commented 2 years ago

Proposed bad solutions:

Proposed perfect solutions:

Only issue I see is defining what is perfect performance on batch integration, since there's a trade-off between mixing and bio-conservation. cc @danielStrobl

scottgigante-immunai commented 2 years ago

We could allow multiple perfect datasets, in which case 1 will be defined as the best performance among the perfect datasets -- one would be perfect batch mixing, the other would be no batch mixing / perfect bio conservation

scottgigante-immunai commented 2 years ago

If we allow multiple bad solutions, then perfect for bio-conservation (do nothing) can serve as a baseline for batch mixing, and perfect for batch mixing (total mixing) can serve as a baseline for bio-conservation.

wes-lewis commented 2 years ago

The proposed bad and perfect methods for denoising sound good to me!

lazappi commented 2 years ago

I really like the idea of providing more relevant reference values to help interpret the metrics (I'm looking at doing something similar for a benchmarking project) but there may be some issues to consider around choosing what the reference methods are (mostly speculation but worth considering).

LuckyMD commented 2 years ago

I'll answer from my perspective and based on discussions we had yesterday in the meeting (most of which were things Scott had immediate answers to).

  • What happens if a method outperforms the "perfect" reference (do you allow scores greater than 1)?

Yes, this is fine. It's a good comparative indicator of method performance IMO. We would use the rescaled metric scores to average and then rank in the end, so this shouldn't be an issue.

  • What happens if a method performs worse than the "random" random (do you allow negative scores)?

This would be cause for concern and lead to methods being re-evaluated/fixed and potentially removed if it's the method itself. Negative values should not exist.

  • What if the "random" reference is significantly worse than any real method (and therefore all the real methods get similar, very high scores)?

I would view this as a useful assessment of:

  1. metric quality
  2. task difficulty (if the same is also true for other metrics)

Either way, having a very low random baseline score still indicates the effective range of a metric IMO, but just that methods all do very well. This may also be a way to assess saturation of a particular task and that we are nearing solution of this challenge (if this is true for all metrics for a task).

scottgigante-immunai commented 2 years ago

Basically agree with all of Malte's points here, with some additional points. Firstly, if we don't enforce the presence of at least two baselines, we have to normalize to min and max over all methods, not min and max over all baseline methods (or failing that, we could normalize over all baselines if at least two are present, and use all methods if two baselines are not present) -- this decision caveats all further discussion of scores < 0 or > 1.

What happens if a method outperforms the "perfect" reference (do you allow scores greater than 1)?

This indicates to us that we should think harder about what is a perfect reference for this task, but in the interim is still a helpful measurement of the method's performance.

lazappi commented 2 years ago

Sound reasonable. I suspect there will be a range of edge cases to sort out as this is implemented for more scenarios though. I wouldn't be too quick to rule out methods performing worse than random though. I've only done a few tests for my benchmark but I've already seen that in a few cases. I think particularly it can happen when you have metrics that measure opposing things and a method optimises for one of those at the cost of the other while a "random" method is more neutral.

scottgigante-immunai commented 2 years ago

^ This is a good point. If a method performs worse than random on all or most metrics, that would be concerning (and point to either a bug or a very bad method) but if it performs well on other metrics at the same time I'd be inclined to leave it alone.

LuckyMD commented 2 years ago

I guess negative scores don't matter for the ranking as we can still average them with other metrics before ranking. If overall metric average is negative, then we have something weird going on as you suggest.

dburkhardt commented 2 years ago

Just seeing this now. I think two layers I'd like to add:

  1. I think the risk of a random baseline being pathologically bad and all methods fighting for hundredths of a decimal at the top is a bigger risk than we're acknowledging, but I don't think we need to do anything about it yet.
  2. If there's no perfect baseline for 0 or 1, I don't think we rescale. We just post the baseline on the leaderboard.
LuckyMD commented 2 years ago

Some comments on your points:

  1. I think this is a very real situation for competitions. We may need to revisit this when starting to ingest Open Problems competitions results to see if this is something we can solve. However, the alternative for not using random and perfect baselines is that you generally have an even larger range of possible values, so the methods are fighting for even smaller changes.

  2. I think we need to be consistent. Either we rescale all results or none of them. It will be more confusing to users to have to recalibrate what they see. What is the issue with rescaling if there is not a perfect baseline? We may get negative values or values larger than 1, but probably only marginally larger. Is this a problem for users?

scottgigante-immunai commented 2 years ago

I think the risk of a random baseline being pathologically bad and all methods fighting for hundredths of a decimal at the top is a bigger risk than we're acknowledging

Imho if all methods are scoring extremely similarly compared to the gap between random and perfect performance, that is important information. If we had e.g. classification accuracy on MNIST as a task, you would see very quickly that this task is basically solved / all methods perform more or less equally well.

Is this a problem for users

rcannood commented 2 years ago

Regardless of whether and how any rescaling is done, I think it's definitely valuable to display (or have the option of displaying) the random, perfect and baseline results alongside the other methods. This allows to compare how some methods compare to different baseline methods (e.g. majority vote vs. random labels in label projection).

For example, in the NeurIPS 2021 pre-competition paper the Positive Controls (PC) would perform better than the Baseline methods (B), which would in turn perform better than the Negative Controls (NC).

Screenshot from 2022-11-16 13-48-43

(This visualisation is not ideal because it was trying to display many things at the same time)


I like the idea of rescaling the metrics to be able to compare across datasets and across metrics in a meaningful way. I'm afraid that the perfect solution as a max threshold and the worst random method as a min cutoff is problematic in a few ways:

Screenshot from 2022-11-16 13-54-20


I would propose always displaying the raw values on the website, at least for the time being. I like being able to toggle between raw and rescaled, but only displaying the rescaled values is very confusing. At any rate, the rescaled values are only really necessary when comparing across datasets or averaging metrics to get a final ranking.

How about we get a set of raw results from our current benchmarks, create a separate repo, and try different rescaling / normalisation methods to be able to discuss and compare the different approaches in a meaningful way?


In the benchmark for trajectory inference methods, we performed normalisation by computing the mean and SD for every metric per dataset and applying a sigmoid transformation. Example:

Screenshot from 2022-11-16 14-02-09

Cons:

Pros:

scottgigante-immunai commented 2 years ago

in the dimensionality reduction task, there is no way of defining a positive control

The positive control here is just the unreduced data.

Trying to find a solution which performs well on all of the metrics is similar to solving the task itself.

Except we can cheat :) See above. Also, we don't need a single baseline to perform well on all metrics. Discussed above.

Nobody expects to see an F1 score of -0.03

This is an argument in favour of clear communication that the metrics are scaled, not necessarily to not display the scaled metrics. Having everything range from ~0 to ~1 gives people an idea of how good a score is. Yes, F1 is well understood and most people know what an F1 score of 0.8 means, but how good is a "trustworthiness" score of 0.8? Putting everything on a common scale makes it more interpretable, not less.

have the option of displaying the random, perfect and baseline results alongside the other methods

I like this, and I think we talked about doing it already -- just requires some kind of flag on the website, since otherwise the task summary will show the perfect baseline as the best performing method on every dataset.

might not be meaningful, since the actual range of metric values in a given benchmark might be very narrow.

This has also been discussed above. If the range of metric values is very narrow, this is important information -- if all methods score between 0.5 and 0.501 (scaled), this tells us that all methods are extremely similar and we shouldn't pay too much attention to the differences between them on this metric.

normalisation by computing the mean and SD for every metric per dataset and applying a sigmoid transformation

This implies that the best performing method is "perfect", which I think is quite misleading. It also means the scaling will change as we add more methods, which I don't love.

LuckyMD commented 2 years ago

normalisation by computing the mean and SD for every metric per dataset and applying a sigmoid transformation

I would also argue that working with baselines is better than scaling by method results. We did the latter for our data integration benchmark, which meant that overall rankings could change if a new method was added (and this actually happened occasionally). I'm not sure if negative and positive baselines perfectly define the active range of a metric, but at least it's more stable.

have the option of displaying the random, perfect and baseline results alongside the other methods

I agree this should be the default.

Allowing the normalised values to be negative when it's worse than the worst random method and >1 when it's better than the perfect method (??) is very misleading.

I actually think that if we can clarify how the values were scaled that this would be a very interpretable solution. What I worry about more is that having very poorly performing methods on some metric would up-weight that metric compared to others in aggregation... but maybe that's actually okay.

How about we get a set of raw results from our current benchmarks, create a separate repo, and try different rescaling / normalisation methods to be able to discuss and compare the different approaches in a meaningful way?

We have to rank results anyway, so we need to use some form of scaling already now. We can test different approaches, but I would keep one version of scaled data in the results as they are displayed. A toggle between raw and scaled has been suggested several times now, which we should do. Until we have that... I guess the only question is which version of the results to display now. Maybe raw is a good idea while we can fix the communication parts. However, that may also confuse people if they don't understand the weighting and thus the ranking.

scottgigante-immunai commented 2 years ago

TBH I think the toggle is priority #1.

LuckyMD commented 2 years ago

Yeah, you may be right. The website looks pretty good already... further prettification is not as important as being able to fully understand the results.

rcannood commented 2 years ago

I'm working on rendering raw values in openproblems-bio/website#58.

However, while going through the results, I see some strange effects of the scaling. Namely, there are a lot of values which lie outside the [0, 1] range:

(-1e+03,-100]    (-100,-10]      (-10,-1]        (-1,0]         (0,1]        (1,10]      (10,100] 
            5            10            12           131           739           535            20 

A histogram per task and per metric gives a little more insight:

dens

If we order the summary statistics by the max absolute value, we can see that the task that are most affected by this are the ones which don't have a good groundtruth method to fall back on.

task_id metric_id min q25 q50 q75 max
spatial_decomposition r2 -915.892 -4.229 0.093 0.527 0.890
dimensionality_reduction local property -9.519 0.807 1.581 2.969 92.007
dimensionality_reduction co-KNN AUC 0.852 2.074 3.080 10.301 46.493
dimensionality_reduction continuity 0.618 1.551 2.995 6.736 19.824
denoising Poisson loss -9.382 -0.039 0.256 0.542 0.976
batch_integration_embed Isolated label Silhouette -0.903 1.204 2.118 2.916 7.820
dimensionality_reduction co-KNN size 0.752 1.044 1.277 1.873 5.442
dimensionality_reduction local continuity meta criterion 0.752 1.044 1.277 1.873 5.442
batch_integration_embed Silhouette 0.650 1.308 1.617 2.674 5.084
batch_integration_graph ARI 1.485 2.320 2.636 4.231 4.422
batch_integration_graph Isolated label F1 0.072 0.804 0.904 1.039 4.260
dimensionality_reduction global property -0.028 0.396 0.724 1.511 2.195
batch_integration_graph NMI 0.568 1.168 1.282 1.468 1.568
batch_integration_embed Batch ASW -0.435 0.517 0.770 0.998 1.474
batch_integration_graph Graph connectivity 0.788 1.207 1.235 1.297 1.331
batch_integration_embed Cell Cycle Score 0.148 0.836 0.981 1.156 1.277
dimensionality_reduction root mean squared error -0.004 -0.003 0.001 0.080 1.180
batch_integration_embed PC Regression 0.000 0.000 0.302 0.892 1.001
label_projection Accuracy -0.031 0.625 0.830 0.935 0.985
label_projection F1 score -0.054 0.518 0.834 0.929 0.985
label_projection Macro F1 score -0.039 0.250 0.570 0.775 0.951
dimensionality_reduction trustworthiness 0.279 0.467 0.732 0.779 0.933
dimensionality_reduction density preservation -0.344 0.021 0.308 0.659 0.887
batch_integration_feature HVG conservation -0.875 -0.476 -0.062 0.184 0.429
cell_cell_communication_source_target Precision-recall AUC -0.019 0.005 0.022 0.036 0.499
multimodal_data_integration Mean squared error -0.024 -0.010 0.020 0.084 0.417
multimodal_data_integration kNN Area Under the Curve -0.019 0.003 0.013 0.042 0.312
denoising Mean-squared error 0.030 0.165 0.184 0.276 0.304
cell_cell_communication_ligand_target Precision-recall AUC 0.028 0.043 0.049 0.054 0.182
cell_cell_communication_ligand_target Odds Ratio 0.000 0.000 -0.000 0.000 0.000
cell_cell_communication_source_target Odds Ratio 0.000 0.000 0.000 0.000 0.000

In general, I think it's a good idea to let benchmarking pipelines only return raw results (e.g. a TSV containing the task_id, method_id, dataset_id, metric_id, value). Adding the metadata (method labels, metric labels, dataset labels, etc) and scaling the values should only happen afterwards, so that we can make modifications to those downstream steps without having to rerun the whole pipeline.

@scottgigante Could you store the raw results somewhere so we can get a better grasp on how to solve this? Could you also add entries for the control methods?

scottgigante commented 2 years ago

I don't have time to reply in full, but:

On Wed, 23 Nov 2022, 5:25 pm Robrecht Cannoodt, @.***> wrote:

I'm working on rendering raw values in openproblems-bio/website#58 https://github.com/openproblems-bio/website/pull/58.

However, while going through the results, I see some strange effects of the scaling. Namely, there are a lot of values which lie outside the [0, 1] range:

(-1e+03,-100] (-100,-10] (-10,-1] (-1,0] (0,1] (1,10] (10,100] 5 10 12 131 739 535 20

A histogram per task and per metric gives a little more insight:

[image: dens] https://user-images.githubusercontent.com/553642/203654864-2615a74c-a984-4181-aa7b-7be5e1a14b7a.png

If we order the summary statistics by the max absolute value, we can see that the task that are most affected by this are the ones which don't have a good groundtruth method to fall back on. task_id metric_id min q25 q50 q75 max spatial_decomposition r2 -915.892 -4.229 0.093 0.527 0.890 dimensionality_reduction local property -9.519 0.807 1.581 2.969 92.007 dimensionality_reduction co-KNN AUC 0.852 2.074 3.080 10.301 46.493 dimensionality_reduction continuity 0.618 1.551 2.995 6.736 19.824 denoising Poisson loss -9.382 -0.039 0.256 0.542 0.976 batch_integration_embed Isolated label Silhouette -0.903 1.204 2.118 2.916 7.820 dimensionality_reduction co-KNN size 0.752 1.044 1.277 1.873 5.442 dimensionality_reduction local continuity meta criterion 0.752 1.044 1.277 1.873 5.442 batch_integration_embed Silhouette 0.650 1.308 1.617 2.674 5.084 batch_integration_graph ARI 1.485 2.320 2.636 4.231 4.422 batch_integration_graph Isolated label F1 0.072 0.804 0.904 1.039 4.260 dimensionality_reduction global property -0.028 0.396 0.724 1.511 2.195 batch_integration_graph NMI 0.568 1.168 1.282 1.468 1.568 batch_integration_embed Batch ASW -0.435 0.517 0.770 0.998 1.474 batch_integration_graph Graph connectivity 0.788 1.207 1.235 1.297 1.331 batch_integration_embed Cell Cycle Score 0.148 0.836 0.981 1.156 1.277 dimensionality_reduction root mean squared error -0.004 -0.003 0.001 0.080 1.180 batch_integration_embed PC Regression 0.000 0.000 0.302 0.892 1.001 label_projection Accuracy -0.031 0.625 0.830 0.935 0.985 label_projection F1 score -0.054 0.518 0.834 0.929 0.985 label_projection Macro F1 score -0.039 0.250 0.570 0.775 0.951 dimensionality_reduction trustworthiness 0.279 0.467 0.732 0.779 0.933 dimensionality_reduction density preservation -0.344 0.021 0.308 0.659 0.887 batch_integration_feature HVG conservation -0.875 -0.476 -0.062 0.184 0.429 cell_cell_communication_source_target Precision-recall AUC -0.019 0.005 0.022 0.036 0.499 multimodal_data_integration Mean squared error -0.024 -0.010 0.020 0.084 0.417 multimodal_data_integration kNN Area Under the Curve -0.019 0.003 0.013 0.042 0.312 denoising Mean-squared error 0.030 0.165 0.184 0.276 0.304 cell_cell_communication_ligand_target Precision-recall AUC 0.028 0.043 0.049 0.054 0.182 cell_cell_communication_ligand_target Odds Ratio 0.000 0.000 -0.000 0.000 0.000 cell_cell_communication_source_target Odds Ratio 0.000 0.000 0.000 0.000 0.000

In general, I think it's a good idea to let benchmarking pipelines only return raw results (e.g. a TSV containing the task_id, method_id, dataset_id, metric_id, value). Adding the metadata (method labels, metric labels, dataset labels, etc) and scaling the values should only happen afterwards, so that we can make modifications to those downstream steps without having to rerun the whole pipeline.

@scottgigante https://github.com/scottgigante Could you store the raw results somewhere so we can get a better grasp on how to solve this? Could you also add entries for the control methods?

— Reply to this email directly, view it on GitHub https://github.com/openproblems-bio/openproblems/issues/575#issuecomment-1325726726, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA3DX4AMMMWWXRICID2EYDWJ2KWBANCNFSM6AAAAAAQNX6WEY . You are receiving this because you were mentioned.Message ID: @.***>

rcannood commented 2 years ago

Thanks for the quick response, I'll take a look :)

rcannood commented 2 years ago

I think these two plots shed some light cases where the baselines are having issues.

dens_raw.pdf compare_dens.pdf

This is also what I ran into when running the NeurIPS2021 pilot, I couldn't find a positive control method for the embedding task (figure). More detailed results in the pdfs one directory higher.

I think it might be a good idea to annotate baseline methods to specify whether the baseline methods are negative controls, positive controls or intend to be a positive control.

I also think that it will be impossible to define good positive controls for tasks in which there is no ground truth information available (such as denoising and anything to do with embedding). As such, scaling using the [min, max] of baseline results might never really work properly.

Summary of possible scaling approaches

Others?

lazappi commented 1 year ago

In case it's useful, for a benchmarking project I am working on we are going to use a default method as the positive control rather than trying to design a perfect synthetic method. I think this has a couple of advantages, 1) as we have seen it's difficult to predict how metrics behave so using a real method should be more similar to other real methods, 2) it makes the scores a bit more interpretable because you are comparing to something hopefully most people are familiar with. It does have some downsides though, 1) it's a bit subjective how to choose what the default method is (especially for newer/less common tasks), 2) you are still relying on the default method performing relatively well on all the metrics which might not always be the case.

scottgigante-immunai commented 1 year ago

Yeah, I lean away from trying to find a single method that performs well on all metrics, esp in tasks like batch integration where some metrics are in direct contradiction with others.

Some of these issues are already resolved. New results available at https://deploy-preview-60--openproblems-sca.netlify.app/benchmarks/ (raw values at https://github.com/openproblems-bio/openproblems/actions/runs/3550251361)

LuckyMD commented 1 year ago

Just catching up on discussions. Is this still an issue? How many adequate baselines are we missing?

In general, I quite like the baseline-based metric scaling. Method-distribution-based scaling relies on methods capturing the range of meaningful values, which they probably don't as they all try to solve the problem and don't have a poor performance baseline. However, if generating a perfect baseline is challenging, are we increasing the entry barrier for new tasks?

rcannood commented 1 year ago

The dimensionality reduction and batch integration embedding re still a major issue.

Other tasks have minor issues, see here. You can hover over the rows in the table to see more info.

LuckyMD commented 1 year ago

The batch integration embedding issue is just PCR on combat scaled values not working for some reason here. This is not terrible.. we just get two NaNs. The iso label silhouette positive baseline not being good enough is strange though, as I think @scottgigante-immunai had a one-hot encoding of the labels that should give perfect results there... it's not terrible though.

The dimensionality reduction had a fix for better baselines in #712. Has the benchmark been run since then?

scottgigante commented 1 year ago

It has. Cause of the problem is https://github.com/openproblems-bio/openproblems/issues/756. Solution to the benchmarks problem is in https://github.com/openproblems-bio/openproblems/pull/760