openproblems-bio / openproblems

Formalizing and benchmarking open problems in single-cell genomics
MIT License
290 stars 77 forks source link

Add cell cycle score baseline #706

Closed scottgigante-immunai closed 1 year ago

scottgigante-immunai commented 1 year ago

cc_score was not well captured by existing baselines. This method should get a perfect score.

LuckyMD commented 1 year ago

Not sure this will get a perfect score, as the metric checks whether the variance contribution has changes per batch from unintegrated data. This will have more variance contribution and therefore perform poorly again. I suggest we just add "unintegrated" as a baseline.

scottgigante-immunai commented 1 year ago

Valid point, but unintegrated is already there and performs poorly. I'll look at the paper and figure out what this should look like.

Re: loaders, I don't know if we want to enforce this on all loaders yet (but we could!)

On Mon, 28 Nov 2022, 7:22 am MalteDLuecken, @.***> wrote:

@.**** requested changes on this pull request.

As mentioned, I don't think this will get a score of 1 as it increases CC variance contribution (I think we also penalize for that). Also, should organism be added on the level of data loader or dataset function?

— Reply to this email directly, view it on GitHub https://protect.checkpoint.com/v2/___https://github.com/openproblems-bio/openproblems/pull/706%23pullrequestreview-1195712548___.YzJlOmltbXVuYWk6YzpnOjE4NDEwOTAxZjE2NzM5YzhlZGU0NWM2NDhlY2ZmMWMyOjY6NTM3OTplZmZlM2JjY2IzNzc5YTE5MjA1NzMxM2E4NTE0ZTM5MTJmMjUxYTI4MzczMDQyM2U5Y2ExZjYyMjU4ZDI2ZDAyOmg6VA, or unsubscribe https://protect.checkpoint.com/v2/___https://github.com/notifications/unsubscribe-auth/AUHCMAS62PWC56GOJR7ZN5DWKSPWFANCNFSM6AAAAAASME4SSU___.YzJlOmltbXVuYWk6YzpnOjE4NDEwOTAxZjE2NzM5YzhlZGU0NWM2NDhlY2ZmMWMyOjY6ODU0OToyOGFkYzQ0MDNhODU1NTJkMjlkYzI2ZjMxYTNiMDU4MWI3ODE3ODRhNGNlYjA4YmU5MmVmNjE5MjQwYTYzMWEzOmg6VA . You are receiving this because you authored the thread.Message ID: @.***>

-- PLEASE NOTE: The information contained in this message is privileged and confidential, and is intended only for the use of the individual to whom it is addressed and others who have been specifically authorized to receive it. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, or if any problems occur with the transmission, please contact the sender.

LuckyMD commented 1 year ago

Okay, unintegrated shouldn't be bad... by definition this should work. Maybe @danielStrobl can take a look at this?

LuckyMD commented 1 year ago

Re: loaders, I don't know if we want to enforce this on all loaders yet (but we could!)

We don't have to enforce on all loaders yet, but it does seem to me that it's something that is part of adding a dataset to open problems. Ideally, if someone adds a dataset, then it should be re-usable by other tasks without someone having to research what this actually is. If we use CELLxGENE more frequently in future, this field will also exist there. We could even borrow from their schema here if we want. Looks like i'm arguing myself into the position of "it should be used by all data loaders"... hmm... maybe do this in steps? First for these datasets, then for the rest in future.

scottgigante-immunai commented 1 year ago

Okay, unintegrated shouldn't be bad... by definition this should work. Maybe @danielStrobl can take a look at this?

Ah, I see. by definition here:

CCconservation=1−|Varafter−Varbefore|/Varbefore

unintegrated should get a perfect score... if you look at https://openproblems-nbt2022-reproducibility.netlify.app/results/batch_integration_embed/ it gets a score of 0.75, compared to scanorama, harmony, combat that score closer to 0.9.

scottgigante-immunai commented 1 year ago

Looks like i'm arguing myself into the position of "it should be used by all data loaders"... hmm... maybe do this in steps? First for these datasets, then for the rest in future.

I agree. Ultimately it should be something that every dataset defines, but we don't need it right now.

scottgigante-immunai commented 1 year ago

I think something is wrong with the cc_score metric.

>>> import openproblems
>>> adata = openproblems.tasks.batch_integration_embed.datasets.immune_batch()
>>> adata_combat = openproblems.tasks.batch_integration_embed.methods.combat_hvg_unscaled(adata.copy())
>>> adata_baseline = openproblems.tasks.batch_integration_embed.methods.no_integration(adata.copy())
>>> openproblems.tasks.batch_integration_embed.metrics.cc_score(adata_combat)
0.8884892448780356
>>> openproblems.tasks.batch_integration_embed.metrics.cc_score(adata_baseline)
0.7510949842852523

if I call directly from scib:

>>> from scib.metrics import cell_cycle
>>> adata_baseline.obsm["X_pca"] = adata_baseline.obsm["X_uni_pca"]
>>> cell_cycle(adata_baseline, adata_baseline, "batch", embed="X_emb", organism="human")
0.7510949842852523

or if I compute the PCA myself:

>>> import scanpy as sc
>>> sc.pp.pca(adata)
>>> adata.obsm["X_emb"] = adata.obsm["X_pca"]
>>> cell_cycle(adata_baseline, adata_baseline, "batch", embed="X_emb", organism="human")
0.7510949842852523

cell_cycle recomputes PCA (see https://github.com/theislab/scib/blob/f0be8267256427e307c5979f4d20dc3e5dc33d04/scib/metrics/cell_cycle.py#L175) even if PCA is already computed. Could this be the cause?

LuckyMD commented 1 year ago

I don't see why recomputing should be the issue tbh... As long as it's using the embedding. I think we decided to recompute X_pca from X_emb as sometimes X_pca is a remnant of the unintegrated embedding (e.g., in FastMNN or Scanorama), so it makes sense to recompute as you don't know where the existing X_pca comes from.

scottgigante-immunai commented 1 year ago

It scores 1.0 if you don't give it an embedding (i.e., it computes PCA on adata.X for both pre and post.) If you give it X_emb == X_uni_pca, it scores 0.75. This is definitely a bug.

scottgigante-immunai commented 1 year ago

New approach -- simply passing the raw data gives a perfect score, so let's just do that. In theory passing X_uni_pca should be right, but pending a fix from scIB, this will work.

scottgigante-immunai commented 1 year ago

Tests passing at https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/5djcm9XQYGrhkJ

scottgigante-immunai commented 1 year ago

@danielStrobl I assume Malte is gone by now. Mind reviewing this for me?

codecov[bot] commented 1 year ago

Codecov Report

Base: 95.04% // Head: 95.04% // Decreases project coverage by -0.00% :warning:

Coverage data is based on head (d981d42) compared to base (aa22537). Patch coverage: 86.95% of modified lines in pull request are covered.

:exclamation: Current head d981d42 differs from pull request most recent head a69f851. Consider uploading reports for the commit a69f851 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #706 +/- ## ========================================== - Coverage 95.04% 95.04% -0.01% ========================================== Files 154 154 Lines 4073 4093 +20 Branches 207 207 ========================================== + Hits 3871 3890 +19 - Misses 131 132 +1 Partials 71 71 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `95.04% <86.95%> (-0.01%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/openproblems-bio/openproblems/pull/706?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio) | Coverage Δ | | |---|---|---| | [...egration/batch\_integration\_embed/metrics/\_utils.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRyaWNzL191dGlscy5weQ==) | `100.00% <ø> (+33.33%)` | :arrow_up: | | [...egration/batch\_integration\_graph/methods/\_utils.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9ncmFwaC9tZXRob2RzL191dGlscy5weQ==) | `61.11% <50.00%> (-3.18%)` | :arrow_down: | | [...ration/batch\_integration\_embed/methods/baseline.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRob2RzL2Jhc2VsaW5lLnB5) | `97.29% <87.50%> (-2.71%)` | :arrow_down: | | [.../\_batch\_integration/batch\_integration\_embed/api.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9hcGkucHk=) | `100.00% <100.00%> (ø)` | | | [...ration/batch\_integration\_embed/metrics/cc\_score.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRyaWNzL2NjX3Njb3JlLnB5) | `100.00% <100.00%> (ø)` | | | [...gration/batch\_integration\_graph/datasets/immune.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9ncmFwaC9kYXRhc2V0cy9pbW11bmUucHk=) | `100.00% <100.00%> (ø)` | | | [...ation/batch\_integration\_graph/datasets/pancreas.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9ncmFwaC9kYXRhc2V0cy9wYW5jcmVhcy5weQ==) | `100.00% <100.00%> (ø)` | | | [...integration/batch\_integration\_graph/methods/mnn.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9ncmFwaC9tZXRob2RzL21ubi5weQ==) | `100.00% <100.00%> (ø)` | | | [...ation/batch\_integration\_graph/methods/scanorama.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/706/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9ncmFwaC9tZXRob2RzL3NjYW5vcmFtYS5weQ==) | `100.00% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

scottgigante-immunai commented 1 year ago

Achieves perfect performance per https://github.com/theislab/scib/issues/351#issuecomment-1335424554