Disable doublet analysis

schultzmattd commented 5 years ago

For one of our use cases, we use demuxlet to compare a single cell RNA-seq data set to a large number of samples in a VCF. In this instance, we don't care about doublet assignments, but just want to find which cells are singlets and which is the most likely sample. Unfortunately, we run into memory issues when demuxlet tries to find doublets as there are so many pairs of possible samples. It doesn't seem like an option exists to avoid this OOM crash (i.e., skip doublet searching). If it doesn't exist, would it be possible to implement a feature like this? I am happy to try myself and submit a PR, but it's not clear to me where in the codebase such a change would go. Any other tips for VCF files that have a large number of samples would also be appreciated! Thanks in advance.

hyunminkang commented 5 years ago

How many cells, SNPs, and individuals are you considering? The doublet search may not be causing the memory errors, so wanted to make sure..

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Mon, Sep 9, 2019 at 8:03 PM Matt Schultz notifications@github.com wrote:

For one of our use cases, we use demuxlet to compare a single cell RNA-seq data set to a large number of samples in a VCF. In this instance, we don't care about doublet assignments, but just want to find which cells are singlets and which is the most likely sample. Unfortunately, we run into memory issues when demuxlet tries to find doublets as there are so many pairs of possible samples. It doesn't seem like an option exists to avoid this OOM crash (i.e., skip doublet searching). If it doesn't exist, would it be possible to implement a feature like this? I am happy to try myself and submit a PR, but it's not clear to me where in the codebase such a change would go. Any other tips for VCF files that have a large number of samples would also be appreciated! Thanks in advance.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/52?email_source=notifications&email_token=ABPY5ONAYYGRMCXOQMSJIP3QI3P5ZA5CNFSM4IVBLGZ2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HKJ4UQA, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPY5OMR2RNYJQPOHYTLMM3QI3P5ZANCNFSM4IVBLGZQ .

schultzmattd commented 5 years ago

Thanks so much for the quick reply Hyun. I didn't realize a colleague of mine had pointed out the same request on this issue where he pointed out how many cells/SNPs:

~10k cells, ~50 samples (yes, much), ~500k SNPs in my case, memory is ~32 Gb.
(and it worked with ~10k SNPs flawlessly)

hyunminkang commented 5 years ago

Does it work with smaller number of samples? I just wanted to make sure that the issue is double detection.

Also, https://github.com/statgen/popscle can run demuxlet too, and I suspect that this may result in lower memory footprint, although the preprocessing step may consume quite a bit of memory.

Thanks, Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Tue, Sep 10, 2019 at 10:47 AM Matt Schultz notifications@github.com wrote:

Thanks so much for the quick reply Hyun. I didn't realize a colleague of mine had pointed out the same request on this issue https://github.com/statgen/demuxlet/issues/37 where he pointed out how many cells/SNPs:

~10k cells, ~50 samples (yes, much), ~500k SNPs in my case, memory is ~32 Gb. (and it worked with ~10k SNPs flawlessly)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/52?email_source=notifications&email_token=ABPY5OIGATNINU4NM6WCLJDQI6XQVA5CNFSM4IVBLGZ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6LLP3Y#issuecomment-529971183, or mute the thread https://github.com/notifications/unsubscribe-auth/ABPY5OOARVUU7HKWGAHLMVDQI6XQVANCNFSM4IVBLGZQ .

schultzmattd commented 5 years ago

Yep, we're able to run the workflow on smaller subsets of samples. Not sure exactly where the breakpoint is, but we've run it successfully with that SNP set on that number of cells with 6-8 individuals.

VincentGardeux commented 4 years ago

Fix #59 would fix the memory issue. We tested on ~50 genotypes / 5M snps and it runs without OOM

jamesnemesh commented 4 years ago

I'm interested in disabling doublet analysis for a different reason: errors in pool construction.

Let's say your lab has 200 available samples to pool, and you select a set of 50 for your next pool (we run pools of over 100 samples, so this is a pretty trivial number.) You have the expected set of samples, but you'd like to re-identify all of the cells with out prior bias, such that a contamination event or a label/plate swap can be detected. There's no need to identify doublets, as you want to assess which samples are significant contributors to the pool - IE: all samples that have more cells than the expected assignment error rate.

Once you have that list, you can correct your sample list to the correct set of samples, and then run doublet detection on that set.

Is there a way to effectively split up processing? If not, do you detect sample swap errors by running without the --sm-list argument, and then doublet detection runs on all available sample pairs, even though some may not be in the pool? Thanks for your help.

hyunminkang commented 4 years ago

I think it is straightforward to disable doublet analysis or speed-up the doublet detection process on some occasions. We will do this in statgen/popscle as that package should have demuxlet implemented and more actively managed.

Thanks, Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : hmkang@umich.edu

On Sat, Jul 18, 2020 at 12:07 PM jamesnemesh notifications@github.com wrote:

I'm interested in disabling doublet analysis for a different reason: errors in pool construction.

Let's say your lab has 200 available samples to pool, and you select a set of 50 for your next pool (we run pools of over 100 samples, so this is a pretty trivial number.) You have the expected set of samples, but you'd like to re-identify all of the cells with out prior bias, such that a contamination event or a label/plate swap can be detected. There's no need to identify doublets, as you want to assess which samples are significant contributors to the pool - IE: all samples that have more cells than the expected assignment error rate.

Once you have that list, you can correct your sample list to the correct set of samples, and then run doublet detection on that set.

Is there a way to effectively split up processing? If not, do you detect sample swap errors by running without the --sm-list argument, and then doublet detection runs on all available sample pairs, even though some may not be in the pool? Thanks for your help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/52#issuecomment-660503714, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OMCL467AKFQUNVPG3TR4HCC7ANCNFSM4IVBLGZQ .

statgen / demuxlet