Closed scottgigante-immunai closed 1 year ago
Proposed solution: each task requires a sample dataset with perfect performance and a sample dataset with "bad" (random, ideally) performance. All metrics are scaled between 0 (bad) and 1 (perfect).
Proposed bad solutions:
batch_integration_graph
: graph edges are created randomly within-batch, so every cell is connected only within its own batch but no biology is conservedbatch_integration_embedding
: embeddings are generated by PCA and then coordinates shuffled within each batchbatch_integration_feature
: counts are shuffled within each batchdenoising
: do nothinglabel_projection
: random labels, weighted by class frequencymultimodal_data_integration
: aligned embeddings are generated from a random normal distributionregulatory_effect_prediction
: gene scores are generated from a random uniform distributionspatial_decomposition
: assign the same proportions to every voxel, proportionate to class frequencycell_cell_communication
: choose ligands/receptors and cell types randomly and assign a random uniform scoreProposed perfect solutions:
batch_integration_graph
: ?batch_integration_embedding
: ?batch_integration_feature
: ?denoising
: copy the true counts from test setlabel_projection
: copy true labels from test setmultimodal_data_integration
: aligned embeddings are generated from one modality and copied to the otherregulatory_effect_prediction
: gene scores are copied from expression dataspatial_decomposition
: proportions are copied from ground truthcell_cell_communication
: copy interactions from target dataOnly issue I see is defining what is perfect performance on batch integration, since there's a trade-off between mixing and bio-conservation. cc @danielStrobl
We could allow multiple perfect datasets, in which case 1 will be defined as the best performance among the perfect datasets -- one would be perfect batch mixing, the other would be no batch mixing / perfect bio conservation
If we allow multiple bad solutions, then perfect for bio-conservation (do nothing) can serve as a baseline for batch mixing, and perfect for batch mixing (total mixing) can serve as a baseline for bio-conservation.
The proposed bad and perfect methods for denoising sound good to me!
I really like the idea of providing more relevant reference values to help interpret the metrics (I'm looking at doing something similar for a benchmarking project) but there may be some issues to consider around choosing what the reference methods are (mostly speculation but worth considering).
I'll answer from my perspective and based on discussions we had yesterday in the meeting (most of which were things Scott had immediate answers to).
- What happens if a method outperforms the "perfect" reference (do you allow scores greater than 1)?
Yes, this is fine. It's a good comparative indicator of method performance IMO. We would use the rescaled metric scores to average and then rank in the end, so this shouldn't be an issue.
- What happens if a method performs worse than the "random" random (do you allow negative scores)?
This would be cause for concern and lead to methods being re-evaluated/fixed and potentially removed if it's the method itself. Negative values should not exist.
- What if the "random" reference is significantly worse than any real method (and therefore all the real methods get similar, very high scores)?
I would view this as a useful assessment of:
Either way, having a very low random baseline score still indicates the effective range of a metric IMO, but just that methods all do very well. This may also be a way to assess saturation of a particular task and that we are nearing solution of this challenge (if this is true for all metrics for a task).
Basically agree with all of Malte's points here, with some additional points. Firstly, if we don't enforce the presence of at least two baselines, we have to normalize to min and max over all methods, not min and max over all baseline methods (or failing that, we could normalize over all baselines if at least two are present, and use all methods if two baselines are not present) -- this decision caveats all further discussion of scores < 0 or > 1.
What happens if a method outperforms the "perfect" reference (do you allow scores greater than 1)?
This indicates to us that we should think harder about what is a perfect reference for this task, but in the interim is still a helpful measurement of the method's performance.
Sound reasonable. I suspect there will be a range of edge cases to sort out as this is implemented for more scenarios though. I wouldn't be too quick to rule out methods performing worse than random though. I've only done a few tests for my benchmark but I've already seen that in a few cases. I think particularly it can happen when you have metrics that measure opposing things and a method optimises for one of those at the cost of the other while a "random" method is more neutral.
^ This is a good point. If a method performs worse than random on all or most metrics, that would be concerning (and point to either a bug or a very bad method) but if it performs well on other metrics at the same time I'd be inclined to leave it alone.
I guess negative scores don't matter for the ranking as we can still average them with other metrics before ranking. If overall metric average is negative, then we have something weird going on as you suggest.
Just seeing this now. I think two layers I'd like to add:
Some comments on your points:
I think this is a very real situation for competitions. We may need to revisit this when starting to ingest Open Problems competitions results to see if this is something we can solve. However, the alternative for not using random and perfect baselines is that you generally have an even larger range of possible values, so the methods are fighting for even smaller changes.
I think we need to be consistent. Either we rescale all results or none of them. It will be more confusing to users to have to recalibrate what they see. What is the issue with rescaling if there is not a perfect baseline? We may get negative values or values larger than 1, but probably only marginally larger. Is this a problem for users?
I think the risk of a random baseline being pathologically bad and all methods fighting for hundredths of a decimal at the top is a bigger risk than we're acknowledging
Imho if all methods are scoring extremely similarly compared to the gap between random and perfect performance, that is important information. If we had e.g. classification accuracy on MNIST as a task, you would see very quickly that this task is basically solved / all methods perform more or less equally well.
Is this a problem for users
Regardless of whether and how any rescaling is done, I think it's definitely valuable to display (or have the option of displaying) the random, perfect and baseline results alongside the other methods. This allows to compare how some methods compare to different baseline methods (e.g. majority vote vs. random labels in label projection).
For example, in the NeurIPS 2021 pre-competition paper the Positive Controls (PC) would perform better than the Baseline methods (B), which would in turn perform better than the Negative Controls (NC).
(This visualisation is not ideal because it was trying to display many things at the same time)
I like the idea of rescaling the metrics to be able to compare across datasets and across metrics in a meaningful way. I'm afraid that the perfect solution as a max threshold and the worst random method as a min cutoff is problematic in a few ways:
As mentioned earlier, a perfect solution doesn't exist for some of the tasks. For example in subfigure (c) in the image above, or in the dimensionality reduction task, there is no way of defining a positive control. Trying to find a solution which performs well on all of the metrics is similar to solving the task itself.
Allowing the normalised values to be negative when it's worse than the worst random method and >1 when it's better than the perfect method (??) is very misleading. Nobody expects to see an F1 score of -0.03, it makes me think that there is a bug in one of the components.
I would propose always displaying the raw values on the website, at least for the time being. I like being able to toggle between raw and rescaled, but only displaying the rescaled values is very confusing. At any rate, the rescaled values are only really necessary when comparing across datasets or averaging metrics to get a final ranking.
How about we get a set of raw results from our current benchmarks, create a separate repo, and try different rescaling / normalisation methods to be able to discuss and compare the different approaches in a meaningful way?
In the benchmark for trajectory inference methods, we performed normalisation by computing the mean and SD for every metric per dataset and applying a sigmoid transformation. Example:
Cons:
Pros:
in the dimensionality reduction task, there is no way of defining a positive control
The positive control here is just the unreduced data.
Trying to find a solution which performs well on all of the metrics is similar to solving the task itself.
Except we can cheat :) See above. Also, we don't need a single baseline to perform well on all metrics. Discussed above.
Nobody expects to see an F1 score of -0.03
This is an argument in favour of clear communication that the metrics are scaled, not necessarily to not display the scaled metrics. Having everything range from ~0 to ~1 gives people an idea of how good a score is. Yes, F1 is well understood and most people know what an F1 score of 0.8 means, but how good is a "trustworthiness" score of 0.8? Putting everything on a common scale makes it more interpretable, not less.
have the option of displaying the random, perfect and baseline results alongside the other methods
I like this, and I think we talked about doing it already -- just requires some kind of flag on the website, since otherwise the task summary will show the perfect baseline as the best performing method on every dataset.
might not be meaningful, since the actual range of metric values in a given benchmark might be very narrow.
This has also been discussed above. If the range of metric values is very narrow, this is important information -- if all methods score between 0.5 and 0.501 (scaled), this tells us that all methods are extremely similar and we shouldn't pay too much attention to the differences between them on this metric.
normalisation by computing the mean and SD for every metric per dataset and applying a sigmoid transformation
This implies that the best performing method is "perfect", which I think is quite misleading. It also means the scaling will change as we add more methods, which I don't love.
normalisation by computing the mean and SD for every metric per dataset and applying a sigmoid transformation
I would also argue that working with baselines is better than scaling by method results. We did the latter for our data integration benchmark, which meant that overall rankings could change if a new method was added (and this actually happened occasionally). I'm not sure if negative and positive baselines perfectly define the active range of a metric, but at least it's more stable.
have the option of displaying the random, perfect and baseline results alongside the other methods
I agree this should be the default.
Allowing the normalised values to be negative when it's worse than the worst random method and >1 when it's better than the perfect method (??) is very misleading.
I actually think that if we can clarify how the values were scaled that this would be a very interpretable solution. What I worry about more is that having very poorly performing methods on some metric would up-weight that metric compared to others in aggregation... but maybe that's actually okay.
How about we get a set of raw results from our current benchmarks, create a separate repo, and try different rescaling / normalisation methods to be able to discuss and compare the different approaches in a meaningful way?
We have to rank results anyway, so we need to use some form of scaling already now. We can test different approaches, but I would keep one version of scaled data in the results as they are displayed. A toggle between raw and scaled has been suggested several times now, which we should do. Until we have that... I guess the only question is which version of the results to display now. Maybe raw is a good idea while we can fix the communication parts. However, that may also confuse people if they don't understand the weighting and thus the ranking.
TBH I think the toggle is priority #1.
Yeah, you may be right. The website looks pretty good already... further prettification is not as important as being able to fully understand the results.
I'm working on rendering raw values in openproblems-bio/website#58.
However, while going through the results, I see some strange effects of the scaling. Namely, there are a lot of values which lie outside the [0, 1] range:
(-1e+03,-100] (-100,-10] (-10,-1] (-1,0] (0,1] (1,10] (10,100]
5 10 12 131 739 535 20
A histogram per task and per metric gives a little more insight:
If we order the summary statistics by the max absolute value, we can see that the task that are most affected by this are the ones which don't have a good groundtruth method to fall back on.
task_id | metric_id | min | q25 | q50 | q75 | max |
---|---|---|---|---|---|---|
spatial_decomposition | r2 | -915.892 | -4.229 | 0.093 | 0.527 | 0.890 |
dimensionality_reduction | local property | -9.519 | 0.807 | 1.581 | 2.969 | 92.007 |
dimensionality_reduction | co-KNN AUC | 0.852 | 2.074 | 3.080 | 10.301 | 46.493 |
dimensionality_reduction | continuity | 0.618 | 1.551 | 2.995 | 6.736 | 19.824 |
denoising | Poisson loss | -9.382 | -0.039 | 0.256 | 0.542 | 0.976 |
batch_integration_embed | Isolated label Silhouette | -0.903 | 1.204 | 2.118 | 2.916 | 7.820 |
dimensionality_reduction | co-KNN size | 0.752 | 1.044 | 1.277 | 1.873 | 5.442 |
dimensionality_reduction | local continuity meta criterion | 0.752 | 1.044 | 1.277 | 1.873 | 5.442 |
batch_integration_embed | Silhouette | 0.650 | 1.308 | 1.617 | 2.674 | 5.084 |
batch_integration_graph | ARI | 1.485 | 2.320 | 2.636 | 4.231 | 4.422 |
batch_integration_graph | Isolated label F1 | 0.072 | 0.804 | 0.904 | 1.039 | 4.260 |
dimensionality_reduction | global property | -0.028 | 0.396 | 0.724 | 1.511 | 2.195 |
batch_integration_graph | NMI | 0.568 | 1.168 | 1.282 | 1.468 | 1.568 |
batch_integration_embed | Batch ASW | -0.435 | 0.517 | 0.770 | 0.998 | 1.474 |
batch_integration_graph | Graph connectivity | 0.788 | 1.207 | 1.235 | 1.297 | 1.331 |
batch_integration_embed | Cell Cycle Score | 0.148 | 0.836 | 0.981 | 1.156 | 1.277 |
dimensionality_reduction | root mean squared error | -0.004 | -0.003 | 0.001 | 0.080 | 1.180 |
batch_integration_embed | PC Regression | 0.000 | 0.000 | 0.302 | 0.892 | 1.001 |
label_projection | Accuracy | -0.031 | 0.625 | 0.830 | 0.935 | 0.985 |
label_projection | F1 score | -0.054 | 0.518 | 0.834 | 0.929 | 0.985 |
label_projection | Macro F1 score | -0.039 | 0.250 | 0.570 | 0.775 | 0.951 |
dimensionality_reduction | trustworthiness | 0.279 | 0.467 | 0.732 | 0.779 | 0.933 |
dimensionality_reduction | density preservation | -0.344 | 0.021 | 0.308 | 0.659 | 0.887 |
batch_integration_feature | HVG conservation | -0.875 | -0.476 | -0.062 | 0.184 | 0.429 |
cell_cell_communication_source_target | Precision-recall AUC | -0.019 | 0.005 | 0.022 | 0.036 | 0.499 |
multimodal_data_integration | Mean squared error | -0.024 | -0.010 | 0.020 | 0.084 | 0.417 |
multimodal_data_integration | kNN Area Under the Curve | -0.019 | 0.003 | 0.013 | 0.042 | 0.312 |
denoising | Mean-squared error | 0.030 | 0.165 | 0.184 | 0.276 | 0.304 |
cell_cell_communication_ligand_target | Precision-recall AUC | 0.028 | 0.043 | 0.049 | 0.054 | 0.182 |
cell_cell_communication_ligand_target | Odds Ratio | 0.000 | 0.000 | -0.000 | 0.000 | 0.000 |
cell_cell_communication_source_target | Odds Ratio | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
In general, I think it's a good idea to let benchmarking pipelines only return raw results (e.g. a TSV containing the task_id, method_id, dataset_id, metric_id, value). Adding the metadata (method labels, metric labels, dataset labels, etc) and scaling the values should only happen afterwards, so that we can make modifications to those downstream steps without having to rerun the whole pipeline.
@scottgigante Could you store the raw results somewhere so we can get a better grasp on how to solve this? Could you also add entries for the control methods?
I don't have time to reply in full, but:
On Wed, 23 Nov 2022, 5:25 pm Robrecht Cannoodt, @.***> wrote:
I'm working on rendering raw values in openproblems-bio/website#58 https://github.com/openproblems-bio/website/pull/58.
However, while going through the results, I see some strange effects of the scaling. Namely, there are a lot of values which lie outside the [0, 1] range:
(-1e+03,-100] (-100,-10] (-10,-1] (-1,0] (0,1] (1,10] (10,100] 5 10 12 131 739 535 20
A histogram per task and per metric gives a little more insight:
[image: dens] https://user-images.githubusercontent.com/553642/203654864-2615a74c-a984-4181-aa7b-7be5e1a14b7a.png
If we order the summary statistics by the max absolute value, we can see that the task that are most affected by this are the ones which don't have a good groundtruth method to fall back on. task_id metric_id min q25 q50 q75 max spatial_decomposition r2 -915.892 -4.229 0.093 0.527 0.890 dimensionality_reduction local property -9.519 0.807 1.581 2.969 92.007 dimensionality_reduction co-KNN AUC 0.852 2.074 3.080 10.301 46.493 dimensionality_reduction continuity 0.618 1.551 2.995 6.736 19.824 denoising Poisson loss -9.382 -0.039 0.256 0.542 0.976 batch_integration_embed Isolated label Silhouette -0.903 1.204 2.118 2.916 7.820 dimensionality_reduction co-KNN size 0.752 1.044 1.277 1.873 5.442 dimensionality_reduction local continuity meta criterion 0.752 1.044 1.277 1.873 5.442 batch_integration_embed Silhouette 0.650 1.308 1.617 2.674 5.084 batch_integration_graph ARI 1.485 2.320 2.636 4.231 4.422 batch_integration_graph Isolated label F1 0.072 0.804 0.904 1.039 4.260 dimensionality_reduction global property -0.028 0.396 0.724 1.511 2.195 batch_integration_graph NMI 0.568 1.168 1.282 1.468 1.568 batch_integration_embed Batch ASW -0.435 0.517 0.770 0.998 1.474 batch_integration_graph Graph connectivity 0.788 1.207 1.235 1.297 1.331 batch_integration_embed Cell Cycle Score 0.148 0.836 0.981 1.156 1.277 dimensionality_reduction root mean squared error -0.004 -0.003 0.001 0.080 1.180 batch_integration_embed PC Regression 0.000 0.000 0.302 0.892 1.001 label_projection Accuracy -0.031 0.625 0.830 0.935 0.985 label_projection F1 score -0.054 0.518 0.834 0.929 0.985 label_projection Macro F1 score -0.039 0.250 0.570 0.775 0.951 dimensionality_reduction trustworthiness 0.279 0.467 0.732 0.779 0.933 dimensionality_reduction density preservation -0.344 0.021 0.308 0.659 0.887 batch_integration_feature HVG conservation -0.875 -0.476 -0.062 0.184 0.429 cell_cell_communication_source_target Precision-recall AUC -0.019 0.005 0.022 0.036 0.499 multimodal_data_integration Mean squared error -0.024 -0.010 0.020 0.084 0.417 multimodal_data_integration kNN Area Under the Curve -0.019 0.003 0.013 0.042 0.312 denoising Mean-squared error 0.030 0.165 0.184 0.276 0.304 cell_cell_communication_ligand_target Precision-recall AUC 0.028 0.043 0.049 0.054 0.182 cell_cell_communication_ligand_target Odds Ratio 0.000 0.000 -0.000 0.000 0.000 cell_cell_communication_source_target Odds Ratio 0.000 0.000 0.000 0.000 0.000
In general, I think it's a good idea to let benchmarking pipelines only return raw results (e.g. a TSV containing the task_id, method_id, dataset_id, metric_id, value). Adding the metadata (method labels, metric labels, dataset labels, etc) and scaling the values should only happen afterwards, so that we can make modifications to those downstream steps without having to rerun the whole pipeline.
@scottgigante https://github.com/scottgigante Could you store the raw results somewhere so we can get a better grasp on how to solve this? Could you also add entries for the control methods?
— Reply to this email directly, view it on GitHub https://github.com/openproblems-bio/openproblems/issues/575#issuecomment-1325726726, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA3DX4AMMMWWXRICID2EYDWJ2KWBANCNFSM6AAAAAAQNX6WEY . You are receiving this because you were mentioned.Message ID: @.***>
Thanks for the quick response, I'll take a look :)
I think these two plots shed some light cases where the baselines are having issues.
isolated_labels_sil
, silhouette
, silhouette_batch
, cc_score
.graph_connectivity
, nmi
, ari
, isolated_labels_f1
odds_ratio
: positive control has a pos inf scoreodds_ratio
, auprc
: (almost) none of the methods score between the [neg, pos] range.auprc
: (almost) none of the methods score between the [neg, pos] range.poisson
: at least one method performs much better than the positive controlqnn
, qglobal
, lcmc
, continuity
, qnn_auc
, qlocal
: methods perform much better than the positive controlpearson_correlation
, spearman_correlation
: methods perform much worse than the negative controlThis is also what I ran into when running the NeurIPS2021 pilot, I couldn't find a positive control method for the embedding task (figure). More detailed results in the pdfs one directory higher.
I think it might be a good idea to annotate baseline methods to specify whether the baseline methods are negative controls, positive controls or intend to be a positive control.
I also think that it will be impossible to define good positive controls for tasks in which there is no ground truth information available (such as denoising and anything to do with embedding). As such, scaling using the [min, max] of baseline results might never really work properly.
Summary of possible scaling approaches
Others?
In case it's useful, for a benchmarking project I am working on we are going to use a default method as the positive control rather than trying to design a perfect synthetic method. I think this has a couple of advantages, 1) as we have seen it's difficult to predict how metrics behave so using a real method should be more similar to other real methods, 2) it makes the scores a bit more interpretable because you are comparing to something hopefully most people are familiar with. It does have some downsides though, 1) it's a bit subjective how to choose what the default method is (especially for newer/less common tasks), 2) you are still relying on the default method performing relatively well on all the metrics which might not always be the case.
Yeah, I lean away from trying to find a single method that performs well on all metrics, esp in tasks like batch integration where some metrics are in direct contradiction with others.
Some of these issues are already resolved. New results available at https://deploy-preview-60--openproblems-sca.netlify.app/benchmarks/ (raw values at https://github.com/openproblems-bio/openproblems/actions/runs/3550251361)
Just catching up on discussions. Is this still an issue? How many adequate baselines are we missing?
In general, I quite like the baseline-based metric scaling. Method-distribution-based scaling relies on methods capturing the range of meaningful values, which they probably don't as they all try to solve the problem and don't have a poor performance baseline. However, if generating a perfect baseline is challenging, are we increasing the entry barrier for new tasks?
The dimensionality reduction and batch integration embedding re still a major issue.
Other tasks have minor issues, see here. You can hover over the rows in the table to see more info.
The batch integration embedding issue is just PCR on combat scaled values not working for some reason here. This is not terrible.. we just get two NaNs. The iso label silhouette positive baseline not being good enough is strange though, as I think @scottgigante-immunai had a one-hot encoding of the labels that should give perfect results there... it's not terrible though.
The dimensionality reduction had a fix for better baselines in #712. Has the benchmark been run since then?
It has. Cause of the problem is https://github.com/openproblems-bio/openproblems/issues/756. Solution to the benchmarks problem is in https://github.com/openproblems-bio/openproblems/pull/760
Taking the average of ranks is advised against for biomedical image analysis competitions (https://doi.org/10.1038/s41467-018-07619-7). From the paper:
Originally posted by @LuckyMD in https://github.com/openproblems-bio/openproblems/issues/566#issuecomment-1246925162