statgen / demuxlet

Genetic multiplexing of barcoded single cell RNA-seq
Apache License 2.0
117 stars 25 forks source link

gpAB has a size limit (causes seg faults for large # snps and genotypes) #15

Closed wac closed 6 years ago

wac commented 6 years ago

Specifically, even on a 64-bit system there is a limit on the array sizes, for example on the usual intel systems if

scl.nsnps nv nv * 9

is greater than the max for an integer (2^32 or about 4.3 billion) it will not be able to allocate. For example, something like scl.nsnps=3M and nv=20 would cause segmentation faults.

yimmieg commented 6 years ago

hey wac, i'd recommend filtering the VCFs to only coding. that should also speed things up significantly.

danielseaton commented 6 years ago

Hi @yimmieg ,

I'm having the same issue. I have reduced my vcfs to common exonic variants from ~115 individuals (the individuals who I have data for). The computation is still fast, and I get some output in the 'output.single' file, but the .best and .sing2 files are empty.

The end of the output to console looks like this:

NOTICE [2018/06/02 07:59:10] - Identifying best-matching individual.. NOTICE [2018/06/02 07:59:10] - Processing 1000 droplets... NOTICE [2018/06/02 07:59:11] - Processing 2000 droplets... NOTICE [2018/06/02 07:59:11] - Finished processing 2645 droplets total ./run_demuxlet.sh: line 28: 7546 Segmentation fault (core dumped) /nfs/software/stegle/users/huangh/bin/bin/demuxlet --sam $INBAM --vcf $INVCF --doublet-prior 0.05 --group-list $INBARCODES --out $OUTFILE

I can get around this OK by subsetting the vcf for each individual sample, to the set of genotypes in that sample.

It would be nice if I could run all the genotypes together, though, as then I'd be able to identify sample swaps very easily.

hyunminkang commented 6 years ago

Does it report how many reads and SNPs were informative? Could you paste the following numbers?

Total number input reads: Total number valid droplets observed: Total number valid SNPs observed: Total number of pass-filtered reads:

Most likely this would be a memory problem. Unless you multiplexed all 115 individuals, try to use --sm-list option to specify which subset of samples you want to consider for demultiplexing.

Hyun.

On Sat, Jun 2, 2018 at 4:31 PM Daniel Seaton notifications@github.com wrote:

Hi @yimmieg https://github.com/yimmieg ,

I'm having the same issue. I have reduced my vcfs to common exonic variants from ~115 individuals (the individuals who I have data for). The computation is still fast, and I get some output in the 'output.single' file, but the .best and .sing2 files are empty.

The end of the output to console looks like this:

NOTICE [2018/06/02 07:59:10] - Identifying best-matching individual.. NOTICE [2018/06/02 07:59:10] - Processing 1000 droplets... NOTICE [2018/06/02 07:59:11] - Processing 2000 droplets... NOTICE [2018/06/02 07:59:11] - Finished processing 2645 droplets total ./run_demuxlet.sh: line 28: 7546 Segmentation fault (core dumped) /nfs/software/stegle/users/huangh/bin/bin/demuxlet --sam $INBAM --vcf $INVCF --doublet-prior 0.05 --group-list $INBARCODES --out $OUTFILE

I can get around this OK by subsetting the vcf for each individual sample, to the set of genotypes in that sample.

It would be nice if I could run all the genotypes together, though, as then I'd be able to identify sample swaps very easily.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/statgen/demuxlet/issues/15#issuecomment-394066417, or mute the thread https://github.com/notifications/unsubscribe-auth/AF-Ouci8c2yNFtQF8ol0VBojBTlkzw73ks5t4j9ggaJpZM4SY7zo .

danielseaton commented 6 years ago

Total number input reads: 424010307 Total number valid droplets observed: 2259 Total number valid SNPs observed: 86013 ('markers from the VCF files') Total number of pass-filtered reads: 682154

It uses about 12GB of memory.

Unless you multiplexed all 115 individuals, try to use --sm-list option to specify which subset of samples you want to consider for demultiplexing.

Great, thanks, I missed this option, I will do this. No, I don't have more than 30 in any one sample, and things worked fine when I was using those smaller VCFs.

wac commented 6 years ago

For runs involving large numbers of SNPs, I modified the code to compute gpAB on-demand rather than allocating and precomputing. Mainly, removing the allocation and computation of gpAB and replacing the call

p = gpAB[isnpnvnv9 + jnv9 + k9 + l*3 + m];

with

p = scl.snps[isnp].gps[j3+l] scl.snps[isnp].gps[k*3+m];

There's probably some threshold where it doesn't make sense to precompute the gpAB matrix and it's cheaper to just do it on-the-fly, or gpAB won't fit in memory.