Closed scottgigante-immunai closed 1 year ago
Do you think it's fine to randomly select points or do we need to use some fancier kind of subsampling to make sure we cover the whole embedding?
Two options here -- either subsample enough cells randomly that we feel confident the space is covered, or get fancy and use approximate nearest neighbors to do a density-based subsampling. I'd be happy with either.
Or add new methods for subsampling like sphetcher or geometric sketching.
Honestly I don't think it really matters which method we use, so long as it's memory efficient and relatively fast.
This is kinda what I was thinking. I would probably be happy with random as long as it is still a reasonably high number but we could probably get away with a smaller number with fancier sampling. I guess it depends on the tradeoff between slower/more complex sampling with fewer cells vs simple sampling with more cells and longer metric times (assuming the score accuracy is ~the same).
imho I think 10k cells is probably enough for the zebrafish dataset for now, and this is a better solution than what we currently have. @lazappi are you comfortable approving this for now and opening an issue to propose improving the subsampling?
Patch coverage: 96.45
% and project coverage change: -0.15
:warning:
Comparison is base (
9d16650
) 95.60% compared to head (988279b
) 95.45%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
We don't need to compute co-ranking on the full dataset -- once it's embedded, we can just use a subsample.