iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Big memory usage on small VCF #110

Closed martinghunt closed 6 years ago

martinghunt commented 6 years ago

This VCF file: calls_vcf.txt causes big memory usage with gramtools build. Died on cluster when hit 40GB ram limit.

Reference for that VCF file is here: https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3

Gramtools build report json file: build_report.json.txt

iqbal-lab commented 6 years ago

I guess it's because of this stuff

NC_000962.3 1480873 .   CGCTCGTG    CGCGAGGC,CGCGAGGG,CGCGAGTC,CGCGAGTG,CGCGATGC,CGCGATGG,CGCGATTC,CGCGATTG,CGCGCGGC,CGCGCGGG,CGCGCGTC,CGCGCGTG,CGCGCTGC,CGCGCTGG,CGCGCTTC,CGCGCTTG,CGCTAGGC,CGCTAGGG,CGCTAGTC,CGCTAGTG,CGCTATGC,CGCTATGG,CGCTATTC,CGCTATTG,CGCTCGGC,CGCTCGGG,CGCTCGTC,CGCTCTGC,CGCTCTGG,CGCTCTTC,CGCTCTTG,CGGGAGGC,CGGGAGGG,CGGGAGTC,CGGGAGTG,CGGGATGC,CGGGATGG,CGGGATTC,CGGGATTG,CGGGCGGC,CGGGCGGG,CGGGCGTC,CGGGCGTG,CGGGCTGC,CGGGCTGG,CGGGCTTC,CGGGCTTG,CGGTAGGC,CGGTAGGG,CGGTAGTC,CGGTAGTG,CGGTATGC,CGGTATGG,CGGTATTC,CGGTATTG,CGGTCGGC,CGGTCGGG,CGGTCGTC,CGGTCGTG,CGGTCTGC,CGGTCTGG,CGGTCTTC,CGGTCTTG,CTCGAGGC,CTCGAGGG,CTCGAGTC,CTCGAGTG,CTCGATGC,CTCGATGG,CTCGATTC,CTCGATTG,CTCGCGGC,CTCGCGGG,CTCGCGTC,CTCGCGTG,CTCGCTGC,CTCGCTGG,CTCGCTTC,CTCGCTTG,CTCTAGGC,CTCTAGGG,CTCTAGTC,CTCTAGTG,CTCTATGC,CTCTATGG,CTCTATTC,CTCTATTG,CTCTCGGC,CTCTCGGG,CTCTCGTC,CTCTCGTG,CTCTCTGC,CTCTCTGG,CTCTCTTC,CTCTCTTG,CTGGAGGC,CTGGAGGG,CTGGAGTC,CTGGAGTG,CTGGATGC,CTGGATGG,CTGGATTC,CTGGATTG,CTGGCGGC,CTGGCGGG,CTGGCGTC,CTGGCGTG,CTGGCTGC,CTGGCTGG,CTGGCTTC,CTGGCTTG,CTGTAGGC,CTGTAGGG,CTGTAGTC,CTGTAGTG,CTGTATGC,CTGTATGG,CTGTATTC,CTGTATTG,CTGTCGGC,CTGTCGGG,CTGTCGTC,CTGTCGTG,CTGTCTGC,CTGTCTGG,CTGTCTTC,CTGTCTTG,GGCGAGGC,GGCGAGGG,GGCGAGTC,GGCGAGTG,GGCGATGC,GGCGATGG,GGCGATTC,GGCGATTG,GGCGCGGC,GGCGCGGG,GGCGCGTC,GGCGCGTG,GGCGCTGC,GGCGCTGG,GGCGCTTC,GGCGCTTG,GGCTAGGC,GGCTAGGG,GGCTAGTC,GGCTAGTG,GGCTATGC,GGCTATGG,GGCTATTC,GGCTATTG,GGCTCGGC,GGCTCGGG,GGCTCGTC,GGCTCGTG,GGCTCTGC,GGCTCTGG,GGCTCTTC,GGCTCTTG,GGGGAGGC,GGGGAGGG,GGGGAGTC,GGGGAGTG,GGGGATGC,GGGGATGG,GGGGATTC,GGGGATTG,GGGGCGGC,GGGGCGGG,GGGGCGTC,GGGGCGTG,GGGGCTGC,GGGGCTGG,GGGGCTTC,GGGGCTTG,GGGTAGGC,GGGTAGGG,GGGTAGTC,GGGTAGTG,GGGTATGC,GGGTATGG,GGGTATTC,GGGTATTG,GGGTCGGC,GGGTCGGG,GGGTCGTC,GGGTCGTG,GGGTCTGC,GGGTCTGG,GGGTCTTC,GGGTCTTG,GTCGAGGC,GTCGAGGG,GTCGAGTC,GTCGAGTG,GTCGATGC,GTCGATGG,GTCGATTC,GTCGATTG,GTCGCGGC,GTCGCGGG,GTCGCGTC,GTCGCGTG,GTCGCTGC,GTCGCTGG,GTCGCTTC,GTCGCTTG,GTCTAGGC,GTCTAGGG,GTCTAGTC,GTCTAGTG,GTCTATGC,GTCTATGG,GTCTATTC,GTCTATTG,GTCTCGGC,GTCTCGGG,GTCTCGTC,GTCTCGTG,GTCTCTGC,GTCTCTGG,GTCTCTTC,GTCTCTTG,GTGGAGGC,GTGGAGGG,GTGGAGTC,GTGGAGTG,GTGGATGC,GTGGATGG,GTGGATTC,GTGGATTG,GTGGCGGC,GTGGCGGG,GTGGCGTC,GTGGCGTG,GTGGCTGC,GTGGCTGG,GTGGCTTC,GTGGCTTG,GTGTAGGC,GTGTAGGG,GTGTAGTC,GTGTAGTG,GTGTATGC,GTGTATGG,GTGTATTC,GTGTATTG,GTGTCGGC,GTGTCGGG,GTGTCGTC,GTGTCGTG,GTGTCTGC,GTGTCTGG,GTGTCTTC,GTGTCTTG  .   PASS    SVTYPE=COMPLEX
ffranr commented 6 years ago

This is what we've discussed off-line since opening this issue:

ffranr commented 6 years ago

@martinghunt has reported that reducing the kmer size to 5 solved the problem (as expected).

I'm convinced that this issue occurs whilst building the kmer index and due to the following circumstances:

The lowest cost solution is to reduce the kmer size and thus produce a smaller kmer index. Will open new issues to discuss the kmer size parameter further.

iqbal-lab commented 6 years ago

Agreed

martinghunt commented 6 years ago

I have another example where LSF kills gramtools build because it hits the memory limit of 60GB. I used kmer length of 5, which means there can be at most 4^5=1024 kmers. VCF file: split.16.in.vcf.gz Same reference NC_000962.3 as above.

iqbal-lab commented 6 years ago

Reopen. Martin, let's workaround, Robyn on holiday all week.

iqbal-lab commented 6 years ago

After looking at an example here, I think there has to be some kind of memory bug. Martin has examples with small VCFs covering a kb and ~1 or 2 sites with a lot of alleles, but using k=5 they still use >30Gb of RAM

iqbal-lab commented 6 years ago

Is it possible Robyn that you enumerate all kmers at adjacent heavily multiallelic things and only after enumeration you remove dups?

ffranr commented 6 years ago

The kmer size was long enough to connect sites which had thousands of alleles. This meant that extracting the minimal number of kmers whilst building the kmer index was costly.

I've made two changes as a result of this issue:

Will leave this issue open until we're happy that no further changes need to be made in the short term.

iqbal-lab commented 6 years ago

Memory use of gramtools is down significantly now. Pfalciparum was >100Gb and possibly >600Gb, and is now 35Gb