statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
116 stars 25 forks source link

terminate called after throwing an instance of 'std::bad_array_new_length' #73

Open vkartha opened 3 years ago

vkartha commented 3 years ago

Hi! I had run demuxlet successfully before, but am now encountering an error:

NOTICE [2020/10/19 20:10:39] - Processing 7470000 markers... NOTICE [2020/10/19 20:10:39] - Processing 7480000 markers... NOTICE [2020/10/19 20:10:39] - Processing 7490000 markers... NOTICE [2020/10/19 20:10:39] - Processing 7500000 markers... NOTICE [2020/10/19 20:10:39] - Processing 7510000 markers... NOTICE [2020/10/19 20:10:39] - Processing 7520000 markers... NOTICE [2020/10/19 20:10:39] - Identifying best-matching individual.. NOTICE [2020/10/19 20:10:39] - Processing 1000 droplets... NOTICE [2020/10/19 20:10:39] - Finished processing 1153 droplets total terminate called after throwing an instance of 'std::bad_array_new_length' what(): std::bad_array_new_length Aborted (core dumped)

My call was as follows (same as the one I used before, which worked for a different bam/vcf file combo):

demuxlet --sam ./sample.bam --tag-group DB --field GT --geno-error 0.1 --min-TD 0 --alpha 0.5 --vcf ./hg38_merged_final_filtered.vcf_sorted.vcf --out ./test_demuxlet.out

I haven't seen this error before, and noticed another (perhaps related?) issue that suggested something about memory. Does this point to something similar, or is it different? After running it, I see all 3 output files (.best, .sing2, and .single), but the .best and .sing2 files are empty, assuming since it was terminated.

Any help would be greatly appreciated!

vkartha commented 3 years ago

Sorry, follow up to that, here are the QC logs prior to the markers being processed and that error being thrown:

NOTICE [2020/10/19 20:10:31] - Finished reading 7527981 markers from the VCF file NOTICE [2020/10/19 20:10:31] - Total number input reads : 12252483 NOTICE [2020/10/19 20:10:31] - Total number valid droplets observed : 1153 NOTICE [2020/10/19 20:10:31] - Total number valid SNPs observed : 7527981 NOTICE [2020/10/19 20:10:31] - Total number of read-QC-passed reads : 12214806 NOTICE [2020/10/19 20:10:31] - Total number of skipped reads with ignored barcodes : 0 NOTICE [2020/10/19 20:10:31] - Total number of non-skipped reads with considered barcodes : 11926820 NOTICE [2020/10/19 20:10:31] - Total number of gapped/noninformative reads : 10590479 NOTICE [2020/10/19 20:10:31] - Total number of base-QC-failed reads : 0 NOTICE [2020/10/19 20:10:31] - Total number of redundant reads : 196126 NOTICE [2020/10/19 20:10:31] - Total number of pass-filtered reads : 1140215 NOTICE [2020/10/19 20:10:31] - Total number of pass-filtered reads overlapping with multiple SNPs : 108279 NOTICE [2020/10/19 20:10:31] - Starting to prune out cells with too few reads... NOTICE [2020/10/19 20:10:31] - Finishing pruning out 0 cells with too few reads... NOTICE [2020/10/19 20:10:36] - Starting to identify best matching individual IDs

I was testing a pool of 2 samples, against a joint reference consisting 97 samples.

rhart604 commented 3 years ago

I'm also seeing this error. Identical output as vkartha, above. I tried re-compiling htslib and demuxlet in case it was something to do with newer compilers, but I get the same errors.

Any thoughts?

hyunminkang commented 3 years ago

This seems like a memory-related issues. How many variants and individuals are you using?

Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : @.***

On Mon, Aug 9, 2021 at 1:46 PM rhart604 @.***> wrote:

I'm also seeing this error. Identical output as vkartha, above. I tried re-compiling htslib and demuxlet in case it was something to do with newer compilers, but I get the same errors.

Any thoughts?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/73#issuecomment-895415417, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5ONSZU6ICKPPR42THWLT4AH6BANCNFSM4SXCAMNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

rhart604 commented 3 years ago

It turns out to be due to large numbers of SNPs in the VCF file. I originally had 9 million SNPs and that crashed demuxlet. I filtered down to less than 2 million and it works now. More than about 2 million causes the error. Could demuxlet be modified to allow larger VCF files/more SNPs?

hyunminkang commented 3 years ago

I think it is possible and I believe that there was a pull request that I did not have a chance to merge into yet. I cannot promise the timeline though..

Thanks, Hyun.

Hyun Min Kang, Ph.D. Associate Professor of Biostatistics University of Michigan, Ann Arbor Email : @.***

On Wed, Aug 11, 2021 at 10:47 AM rhart604 @.***> wrote:

It turns out to be due to large numbers of SNPs in the VCF file. I originally had 9 million SNPs and that crashed demuxlet. I filtered down to less than 2 million and it works now. More than about 2 million causes the error. Could demuxlet be modified to allow larger VCF files/more SNPs?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/73#issuecomment-896891221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPY5OK3J2MKQ53XWXEAD7TT4KEPPANCNFSM4SXCAMNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

VicenteFR commented 2 years ago

Has anyone found a clever way to overcome this issue? Since it is related to memory, I thought that downsampling the variants in the vcf file would do, but I can't seem to find any efficient and safe way to downsample vcf files. If anyone has found a good way to do this, would you please share it?

Thanks in advance!

bdferris642 commented 9 months ago

This was happening to me when I tried demultiplexing with 45M+ SNPs. Demuxlet succeeded with ~6M rows in the VCF. I played around with the numbers to find an upper bound for which demuxlet would not abort...I suspect that this will be sensitive to the memory constraints of your machine but not sure...

@VicenteFR I don't know if you're using imputed SNPs or genotyped ones, but a couple principled way to subset the vcf would be to filter on the basis of imputation R^2 or on Minor Allele Frequency if you have access to that information (sometimes included in the INFO field). If none of that information is available, you could make the assumption that all SNPs are equally informative, in which case random downsampling non-header rows of the vcf would be "safe". You could downsample with different seeds and compare the output (which is not ideal but would give you some idea of what fraction of SNPs was necessary to achieve consistent results)

yimmieg commented 9 months ago

Do you need this many SNPs? We often filter for variants in 1000 genomes...

~J

On Nov 17, 2023, at 8:30 AM, bdferris642 @.***> wrote:

This was happening to me when I tried demultiplexing with 45M+ SNPs. Demuxlet succeeded with ~6M rows in the VCF. I played around with the numbers to find an upper bound for which demuxlet would not abort...I suspect that this will be sensitive to the memory constraints of your machine but not sure...

@VicenteFR https://github.com/VicenteFR I don't know if you're using imputed SNPs or real ones, but a couple principled way to subset the vcf would be to filter on the basis of imputation R^2 or on Minor Allele Frequency if you have access to that information (sometimes included in the INFO field). If none of that information is available, you could make the assumption that all SNPs are equally informative, in which case random downsampling non-header rows of the vcf would be "safe". You could downsample with different seeds and compare the output (which is not ideal but would give you some idea of what fraction of SNPs was necessary to achieve consistent results)

— Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/73#issuecomment-1816728939, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCQY7ZLT4GPVKD7GAWYDBDYE6GJNAVCNFSM4SXCAMN2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBRGY3TEOBZGM4Q. You are receiving this because you are subscribed to this thread.

hyunminkang commented 9 months ago

If you are using for scRNA-seq, filtering on 1000G exonic SNPs with MAF > 1% (usually ~300K) should be sufficient.