openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
287 stars 76 forks source link

[Dimensionality reduction] Metrics implicitly prefer specific normalizations #756

Closed scottgigante-immunai closed 1 year ago

scottgigante-immunai commented 1 year ago

The dimensionality reduction metrics pick winners in normalization metrics. E.g. all of the nn_ranking.py methods compute the ground truth on logCPM, 1kHVG data -- so any method that normalizes the data this way will perform better. Trustworthiness, RMSE and density reward preserving the structure of the unnormalized data.

Given the metrics are computed on the input data to assign ground truth, I think we might want to fix a single standard normalization in this task.

lazappi commented 1 year ago

I'm not quite clear on exactly what the cause of the problem here is. Is it that the metrics themselves are inherently biased towards particular normalisations? Or is it that the way they are implemented now one normalisation is used to compute the ground truth and therefore any method that uses that normalisation performs better? Maybe rather than limiting methods to one normalisation we should make sure that metrics are calculated on the same normalisation the method uses?

I think limiting to one normalisation for methods could be problematic in the long run. Apart from any biases there could be methods that require different inputs.

The metrics for this task are also fairly limited at the moment, expanding those would go some way to help as well (but doesn't solve this problem).

scottgigante-immunai commented 1 year ago

Issue is the former : most of these metrics pick a normalisation on which they define ground truth, so implicitly bias towards that same normalisation.

I like the idea of using the same normalisation the method uses. If we do that, we should add a .layers["normalized"] to the API.

On Mon, 19 Dec 2022, 6:54 pm Luke Zappia, @.***> wrote:

I'm not quite clear on exactly what the cause of the problem here is. Is it that the metrics themselves are inherently biased towards particular normalisations? Or is it that the way they are implemented now one normalisation is used to compute the ground truth and therefore any method that uses that normalisation performs better? Maybe rather than limiting methods to one normalisation we should make sure that metrics are calculated on the same normalisation the method uses?

I think limiting to one normalisation for methods could be problematic in the long run. Apart from any biases there could be methods that require different inputs.

The metrics for this task are also fairly limited at the moment, expanding those would go some way to help as well (but doesn't solve this problem).

— Reply to this email directly, view it on GitHub https://checkpoint.url-protection.com/v1/url?o=https%3A//github.com/openproblems-bio/openproblems/issues/756%23issuecomment-1357228930&g=MGFiOTg5OGIxN2YyMDQ0Mg==&h=MTQ5YWZmYWRkNTJlZDZiZTdmMjRkYmNlNzQ3YWI5YTMxNGY2NGQ4NDQwOTMzZjFjNTRlYjVmNDA1YzQ5N2RhYw==&p=YzJlOmltbXVuYWk6YzpnOjhmZTRjNzIwNTcyYzFiNTEwMjBkNDQyZWJjNzU1YmEyOnYxOmg6VA==, or unsubscribe https://checkpoint.url-protection.com/v1/url?o=https%3A//github.com/notifications/unsubscribe-auth/AUHCMAWL36QPHZ3FBHIRMQLWOAICRANCNFSM6AAAAAATB2HVWM&g=NmNhOWNkYzQ2M2QzYmZmZA==&h=NzQzMzlkNmMxN2JkNDQ3NTIyNzFiYTEwMGYzZDI5MjdlZTVkYTdmNTUyZTBlMDk3M2NkMmE0NWU1Yzg3MTM2Yw==&p=YzJlOmltbXVuYWk6YzpnOjhmZTRjNzIwNTcyYzFiNTEwMjBkNDQyZWJjNzU1YmEyOnYxOmg6VA== . You are receiving this because you authored the thread.Message ID: @.***>

-- PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with the transmission, please contact the sender.

scottgigante-immunai commented 1 year ago

An issue with this that I'm not quite sure how to deal with: if a method uses HVGs, it is reducing the dimension of the "ground truth" data, which intrinsically makes the problem easier (e.g. 1000 -> 2 dimensions instead of 20000 -> 2.) A method could give itself trivially good results by reducing to fewer genes before embedding.

@rcannood discussed including biological information as metrics. I'm starting to think that is the right call. E.g. how much does a method keep cell types together, how well is cell cycle preserved. That way, if a method trivially boosts its scores by removing informative genes, it would suffer in biological scores. Could share a lot of ideas with batch_integration_embed.

scottgigante-immunai commented 1 year ago

Apart from any biases there could be methods that require different inputs.

As in batch_integration, we would still provide the raw counts and allow devs to use them if they think they can do better. But the justification from batch integration:

WARNING: other than most tasks, adata.X should contain log-normalized data. This is the case as we are comparing the results of integration on the features pre- and post-integration and the data comes from different technologies. In this subtask, we are computing a pre-integration embedding on the normalized features. For UMI data, the data is scran-normalized, full-length data is TPM-normalized.

I think we are in a similar situation here -- we are comparing pre- and post-dimensionality reduction neighborhoods / variances / distances, which implies a "correct" normalization. Allowing devs to change the ground truth seems dangerous.

scottgigante-immunai commented 1 year ago

Copying @danielStrobl and @LuckyMD for visibility.

lazappi commented 1 year ago

An issue with this that I'm not quite sure how to deal with: if a method uses HVGs, it is reducing the dimension of the "ground truth" data, which intrinsically makes the problem easier (e.g. 1000 -> 2 dimensions instead of 20000 -> 2.) A method could give itself trivially good results by reducing to fewer genes before embedding.

Yes, that is a good point. I have seen that is definitely the case for some metrics in other settings.

@rcannood discussed including biological information as metrics. I'm starting to think that is the right call. E.g. how much does a method keep cell types together, how well is cell cycle preserved. That way, if a method trivially boosts its scores by removing informative genes, it would suffer in biological scores. Could share a lot of ideas with batch_integration_embed.

Yeah, I agree this would be a good idea. It came up a bit in the jamboree but at the time we decided not to require cell labels to make selecting datasets easier (I don't think that's a major limitation though). Just need to make sure we keep this a bit separate from the integration task (but again should be doable).

I think we are in a similar situation here -- we are comparing pre- and post-dimensionality reduction neighborhoods / variances / distances, which implies a "correct" normalization. Allowing devs to change the ground truth seems dangerous.

I think that's a good point. There are some reasons why I don't like this which I will list below but I think it might be necessary in order to preserve the "integrity" of the benchmark. I think the alternatives have already been mentioned and have their own problems/limitations: either use the same normalisation for the ground truth as the method, (maybe) use raw counts for the ground truth or avoid the problem by only using metrics that don't rely on before/after comparisons. I quite like the last one in theory but it might be quite limiting in the kind of metrics you could use.

Reasons for not using a standard normalisation:

Whatever is decided I think this is something that we should try to keep consistent across tasks.

LuckyMD commented 1 year ago

Thanks for roping me in @scottgigante-immunai. I think the best way forward would be to use normalization layers in the dataset as in batch integration, as Scott suggested. One could then also add multiple datasets with different "ground truth" normalizations or even use the same dataset twice like this. That would enable comparisons as well, but still keep the integrity of the benchmark.

In general, solving this normalization issue is important to have fixed before we submit the manuscript. Would you be able to add normalization layers per dataset in the near future @lazappi? I guess as you're only back next week, ealiest by the end of that week? This would affect the results and would then require revisiting the text as well.

lazappi commented 1 year ago

I can maybe get to it next week (if no one else does before then). I would like to confirm that 1) there is general agreement that making normalisation part of the data loader is the way to go and 2) this will be consistent across tasks (or there is a good reason why it can't be). Possibly you discussed this in a meeting already but wanted to confirm.

I knew the results weren't stable so there isn't anything in the text yet anyway (in the small bit I wrote at least).

scottgigante-immunai commented 1 year ago

I think the rule we can be consistent on is "if there exist metrics that are computed on normalized data, then dataset functions should provide this normalized data". Otherwise, I think the principle of not wanting to prescribe normalization methods should stand.

LuckyMD commented 1 year ago

if there exist metrics that are computed on normalized data, then dataset functions should provide this normalized data

I agree with this and it is in line broadly with what we discussed at the meeting yesterday.