Subsample large datasets for pydrmetrics

scottgigante-immunai commented 1 year ago

We don't need to compute co-ranking on the full dataset -- once it's embedded, we can just use a subsample.

github-actions[bot] commented 1 year ago

lazappi commented 1 year ago

Do you think it's fine to randomly select points or do we need to use some fancier kind of subsampling to make sure we cover the whole embedding?

scottgigante-immunai commented 1 year ago

Two options here -- either subsample enough cells randomly that we feel confident the space is covered, or get fancy and use approximate nearest neighbors to do a density-based subsampling. I'd be happy with either.

LuckyMD commented 1 year ago

Or add new methods for subsampling like sphetcher or geometric sketching.

scottgigante-immunai commented 1 year ago

Honestly I don't think it really matters which method we use, so long as it's memory efficient and relatively fast.

lazappi commented 1 year ago

This is kinda what I was thinking. I would probably be happy with random as long as it is still a reasonably high number but we could probably get away with a smaller number with fancier sampling. I guess it depends on the tradeoff between slower/more complex sampling with fewer cells vs simple sampling with more cells and longer metric times (assuming the score accuracy is ~the same).

scottgigante-immunai commented 1 year ago

imho I think 10k cells is probably enough for the zebrafish dataset for now, and this is a better solution than what we currently have. @lazappi are you comfortable approving this for now and opening an issue to propose improving the subsampling?

scottgigante-immunai commented 1 year ago

Passing at https://tower.nf/orgs/openproblems-bio/workspaces/openproblems-bio/watch/3KmNJbPKqG19cd

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 96.45% and project coverage change: -0.15 :warning:

Comparison is base (9d16650) 95.60% compared to head (988279b) 95.45%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #849 +/- ## ========================================== - Coverage 95.60% 95.45% -0.15% ========================================== Files 187 187 Lines 4980 5019 +39 Branches 273 277 +4 ========================================== + Hits 4761 4791 +30 - Misses 141 149 +8 - Partials 78 79 +1 ``` | Flag | Coverage Δ | | |---|---|---| | unittests | `95.45% <96.45%> (-0.15%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio) | Coverage Δ | | |---|---|---| | [...ration/batch\_integration\_embed/metrics/cc\_score.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRyaWNzL2NjX3Njb3JlLnB5) | `100.00% <ø> (ø)` | | | [...n/batch\_integration\_embed/metrics/iso\_label\_sil.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRyaWNzL2lzb19sYWJlbF9zaWwucHk=) | `100.00% <ø> (ø)` | | | [...ntegration/batch\_integration\_embed/metrics/kBET.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRyaWNzL2tCRVQucHk=) | `100.00% <ø> (ø)` | | | [...integration/batch\_integration\_embed/metrics/pcr.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRyaWNzL3Bjci5weQ==) | `100.00% <ø> (ø)` | | | [...ation/batch\_integration\_embed/metrics/sil\_batch.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRyaWNzL3NpbF9iYXRjaC5weQ==) | `100.00% <ø> (ø)` | | | [...tion/batch\_integration\_embed/metrics/silhouette.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9lbWJlZC9tZXRyaWNzL3NpbGhvdWV0dGUucHk=) | `100.00% <ø> (ø)` | | | [...ch\_integration\_feature/metrics/hvg\_conservation.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9mZWF0dXJlL21ldHJpY3MvaHZnX2NvbnNlcnZhdGlvbi5weQ==) | `100.00% <ø> (ø)` | | | [...tegration/batch\_integration\_graph/methods/bbknn.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9ncmFwaC9tZXRob2RzL2Jia25uLnB5) | `100.00% <ø> (ø)` | | | [...egration/batch\_integration\_graph/methods/combat.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9ncmFwaC9tZXRob2RzL2NvbWJhdC5weQ==) | `100.00% <ø> (ø)` | | | [...gration/batch\_integration\_graph/methods/fastmnn.py](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio#diff-b3BlbnByb2JsZW1zL3Rhc2tzL19iYXRjaF9pbnRlZ3JhdGlvbi9iYXRjaF9pbnRlZ3JhdGlvbl9ncmFwaC9tZXRob2RzL2Zhc3Rtbm4ucHk=) | `100.00% <ø> (ø)` | | | ... and [80 more](https://codecov.io/gh/openproblems-bio/openproblems/pull/849?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=openproblems-bio)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

openproblems-bio / openproblems

Subsample large datasets for pydrmetrics #849

Codecov Report