Closed martinghunt closed 6 years ago
I guess it's because of this stuff
NC_000962.3 1480873 . CGCTCGTG CGCGAGGC,CGCGAGGG,CGCGAGTC,CGCGAGTG,CGCGATGC,CGCGATGG,CGCGATTC,CGCGATTG,CGCGCGGC,CGCGCGGG,CGCGCGTC,CGCGCGTG,CGCGCTGC,CGCGCTGG,CGCGCTTC,CGCGCTTG,CGCTAGGC,CGCTAGGG,CGCTAGTC,CGCTAGTG,CGCTATGC,CGCTATGG,CGCTATTC,CGCTATTG,CGCTCGGC,CGCTCGGG,CGCTCGTC,CGCTCTGC,CGCTCTGG,CGCTCTTC,CGCTCTTG,CGGGAGGC,CGGGAGGG,CGGGAGTC,CGGGAGTG,CGGGATGC,CGGGATGG,CGGGATTC,CGGGATTG,CGGGCGGC,CGGGCGGG,CGGGCGTC,CGGGCGTG,CGGGCTGC,CGGGCTGG,CGGGCTTC,CGGGCTTG,CGGTAGGC,CGGTAGGG,CGGTAGTC,CGGTAGTG,CGGTATGC,CGGTATGG,CGGTATTC,CGGTATTG,CGGTCGGC,CGGTCGGG,CGGTCGTC,CGGTCGTG,CGGTCTGC,CGGTCTGG,CGGTCTTC,CGGTCTTG,CTCGAGGC,CTCGAGGG,CTCGAGTC,CTCGAGTG,CTCGATGC,CTCGATGG,CTCGATTC,CTCGATTG,CTCGCGGC,CTCGCGGG,CTCGCGTC,CTCGCGTG,CTCGCTGC,CTCGCTGG,CTCGCTTC,CTCGCTTG,CTCTAGGC,CTCTAGGG,CTCTAGTC,CTCTAGTG,CTCTATGC,CTCTATGG,CTCTATTC,CTCTATTG,CTCTCGGC,CTCTCGGG,CTCTCGTC,CTCTCGTG,CTCTCTGC,CTCTCTGG,CTCTCTTC,CTCTCTTG,CTGGAGGC,CTGGAGGG,CTGGAGTC,CTGGAGTG,CTGGATGC,CTGGATGG,CTGGATTC,CTGGATTG,CTGGCGGC,CTGGCGGG,CTGGCGTC,CTGGCGTG,CTGGCTGC,CTGGCTGG,CTGGCTTC,CTGGCTTG,CTGTAGGC,CTGTAGGG,CTGTAGTC,CTGTAGTG,CTGTATGC,CTGTATGG,CTGTATTC,CTGTATTG,CTGTCGGC,CTGTCGGG,CTGTCGTC,CTGTCGTG,CTGTCTGC,CTGTCTGG,CTGTCTTC,CTGTCTTG,GGCGAGGC,GGCGAGGG,GGCGAGTC,GGCGAGTG,GGCGATGC,GGCGATGG,GGCGATTC,GGCGATTG,GGCGCGGC,GGCGCGGG,GGCGCGTC,GGCGCGTG,GGCGCTGC,GGCGCTGG,GGCGCTTC,GGCGCTTG,GGCTAGGC,GGCTAGGG,GGCTAGTC,GGCTAGTG,GGCTATGC,GGCTATGG,GGCTATTC,GGCTATTG,GGCTCGGC,GGCTCGGG,GGCTCGTC,GGCTCGTG,GGCTCTGC,GGCTCTGG,GGCTCTTC,GGCTCTTG,GGGGAGGC,GGGGAGGG,GGGGAGTC,GGGGAGTG,GGGGATGC,GGGGATGG,GGGGATTC,GGGGATTG,GGGGCGGC,GGGGCGGG,GGGGCGTC,GGGGCGTG,GGGGCTGC,GGGGCTGG,GGGGCTTC,GGGGCTTG,GGGTAGGC,GGGTAGGG,GGGTAGTC,GGGTAGTG,GGGTATGC,GGGTATGG,GGGTATTC,GGGTATTG,GGGTCGGC,GGGTCGGG,GGGTCGTC,GGGTCGTG,GGGTCTGC,GGGTCTGG,GGGTCTTC,GGGTCTTG,GTCGAGGC,GTCGAGGG,GTCGAGTC,GTCGAGTG,GTCGATGC,GTCGATGG,GTCGATTC,GTCGATTG,GTCGCGGC,GTCGCGGG,GTCGCGTC,GTCGCGTG,GTCGCTGC,GTCGCTGG,GTCGCTTC,GTCGCTTG,GTCTAGGC,GTCTAGGG,GTCTAGTC,GTCTAGTG,GTCTATGC,GTCTATGG,GTCTATTC,GTCTATTG,GTCTCGGC,GTCTCGGG,GTCTCGTC,GTCTCGTG,GTCTCTGC,GTCTCTGG,GTCTCTTC,GTCTCTTG,GTGGAGGC,GTGGAGGG,GTGGAGTC,GTGGAGTG,GTGGATGC,GTGGATGG,GTGGATTC,GTGGATTG,GTGGCGGC,GTGGCGGG,GTGGCGTC,GTGGCGTG,GTGGCTGC,GTGGCTGG,GTGGCTTC,GTGGCTTG,GTGTAGGC,GTGTAGGG,GTGTAGTC,GTGTAGTG,GTGTATGC,GTGTATGG,GTGTATTC,GTGTATTG,GTGTCGGC,GTGTCGGG,GTGTCGTC,GTGTCGTG,GTGTCTGC,GTGTCTGG,GTGTCTTC,GTGTCTTG . PASS SVTYPE=COMPLEX
This is what we've discussed off-line since opening this issue:
@martinghunt has reported that reducing the kmer size to 5 solved the problem (as expected).
I'm convinced that this issue occurs whilst building the kmer index and due to the following circumstances:
The lowest cost solution is to reduce the kmer size and thus produce a smaller kmer index. Will open new issues to discuss the kmer size parameter further.
Agreed
I have another example where LSF kills gramtools build because it hits the memory limit of 60GB. I used kmer length of 5, which means there can be at most 4^5=1024 kmers. VCF file: split.16.in.vcf.gz Same reference NC_000962.3 as above.
Reopen. Martin, let's workaround, Robyn on holiday all week.
After looking at an example here, I think there has to be some kind of memory bug. Martin has examples with small VCFs covering a kb and ~1 or 2 sites with a lot of alleles, but using k=5 they still use >30Gb of RAM
Is it possible Robyn that you enumerate all kmers at adjacent heavily multiallelic things and only after enumeration you remove dups?
The kmer size was long enough to connect sites which had thousands of alleles. This meant that extracting the minimal number of kmers whilst building the kmer index was costly.
I've made two changes as a result of this issue:
Will leave this issue open until we're happy that no further changes need to be made in the short term.
Memory use of gramtools is down significantly now. Pfalciparum was >100Gb and possibly >600Gb, and is now 35Gb
This VCF file: calls_vcf.txt causes big memory usage with gramtools build. Died on cluster when hit 40GB ram limit.
Reference for that VCF file is here: https://www.ncbi.nlm.nih.gov/nuccore/NC_000962.3
Gramtools build report json file: build_report.json.txt