medvedevgroup / vargeno

Towards fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics.
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty641/5056043
MIT License
20 stars 4 forks source link

Segmentation fault during geno step #5

Open ldenti opened 5 years ago

ldenti commented 5 years ago

Hi, when I try to run vargeno on the same data linked in my previous issue (https://github.com/medvedevgroup/vargeno/issues/2), it crashes during the geno step.

This is the output of vargeno index:

[BloomFilter constructBfFromGenomeseq] bit vector: 1130814221/9600000000
[BloomFilter constructBfFromGenomeseq] lite bit vector: 2131757218/18400000000
[BloomFilter constructBfFromVCF] bit vector: 68265608/1120000000
SNP Dictionary
Total k-mers:        2593345952
Unambig k-mers:      2367171409
Ambig unique k-mers: 37905369
Ambig total k-mers:  226174543
Ref Dictionary
Total k-mers:        2858648351
Unambig k-mers:      2488558606
Ambig unique k-mers: 61723937
Ambig total k-mers:  370089745

and these are the files produced during the index step:

4.0K    vargeno.RMNISTHS_30xdownsample.index.chrlens
1.2G    vargeno.RMNISTHS_30xdownsample.index.ref.bf
2.2G    vargeno.RMNISTHS_30xdownsample.index.ref.bf.lite.bf
34G     vargeno.RMNISTHS_30xdownsample.index.ref.dict
134M    vargeno.RMNISTHS_30xdownsample.index.snp.bf
39G     vargeno.RMNISTHS_30xdownsample.index.snp.dict

When running the geno step, vargeno prints "Processing..." and crashes shortly thereafter:

Initializing...
Processing...
Segmentation fault (core dumped)

\time reports that it is terminated by signal 11 but I'm not sure where this happens. At first I thought that it was due to RAM saturation (the machine used to test the tool is equipped with 256GB of RAM) but the same behaviour occurs on a cluster with 1TB of RAM.

Anyway, I also tried to run vargeno on a smaller set of variants (I halved the input VCF) and it is able to conclude the analysis.

The complete VCF contains 84739838 variants and the sample consists of 696168435 reads. The whole (unzipped) data accounts for ~240GB of disk space. If you want to reproduce this behaviour on your machine, I can share the data with you.

Luca

bbsunchen commented 5 years ago

Hi Luca, I am working in it, will let you when I fix it.