iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Something funny with kmer indexing Pf #113

Closed iqbal-lab closed 6 years ago

iqbal-lab commented 6 years ago

I'm building the PRG of the New P falciparum PRG. At k=5, the construction works fine, and completes in 6 hours on 1 core

` python3 -m venv gramtools_tip_commit442 source gramtools_tip_commit442/bin/activate pip3 install --upgrade git+https://github.com/iqbal-lab-org/gramtools@442ddfe7d68fca86ad97a123792673edff815b04

gramtools --version { "version_number": "0.5.0", "last_git_commit_hash": "442ddfe7d68fca86ad97a123792673edff815b04", "truncated_git_commits": [ "442ddfe - Robyn Ffrancon, 1525969734 : enhancement: per allele base coverage counter capped at uint16_t max", "ff3e2d6 - Robyn Ffrancon, 1525871127 : enhancement: if kmer size 8 or less, generate all possible kmers and build kmer index", "8db0f15 - Robyn Ffrancon, 1525860767 : enhancement: paths no longer memorised during building kmer index; cleanup", "c806726 - Robyn Ffrancon, 1524587729 : added docker config file", "63c175a - Robyn Ffrancon, 1523897681 : fix: per base coverage recorded when allele length greater than read length" ] }`

That's the basic setup.

Then this is what happened for k=5 - I went a bit over the top and asked for 240Gb of RAM:

bsub.py 240 pf5 gramtools build --gram-directory ./gramk5 --vcf prg_construction/data/pf3k_and_DBPMSPS1and2.vcf --reference reference/Pfalciparum.genome.fasta --max-read-length 150 --kmer-size 5 \ --debug

This worked! Used 35Gb of RAM

` Finished printing linear PRG. Final number in alphabet is 3216844

maximum thread count: 1 Executing build command Generating integer encoded PRG Number of charecters in integer encoded linear PRG: 47471952 Generating FM-Index Generating PRG masks Building kmer index Getting all kmers Getting kmer prefix diffs Indexing kmers Total number of unique kmers: 1024

Timer report: seconds Encoded PRG 2.37 Generate FM-Index 129.52 Generating PRG masks 177.22 Building kmer index 21015.5

Total elapsed time: anf th21324.6

`

This is fine. Now I try k=15

bsub.py 200 pf15 gramtools build --gram-directory ./gramk15 --vcf prg_construction/data/pf3k_and_DBPMSPS1and2.vcf --reference reference/Pfalciparum.genome.fasta --max-read-length 150 --kmer-size 15 --debug

I launch that around May 11 14:33 (ie over 48 hours ago).

and this is what I see

` Finished printing linear PRG. Final number in alphabet is 3216844

maximum thread count: 1 Executing build command Generating integer encoded PRG Number of charecters in integer encoded linear PRG: 47471952 Generating FM-Index Generating PRG masks Building kmer index Getting all kmers Processed paths: 1000000, total expected: 4194304 Processed paths: 2000000, total expected: 4194304 Processed paths: 3000000, total expected: 4194304 Processed paths: 4000000, total expected: 4194304 Processed paths: 1000000, total expected: 2097152 Processed paths: 2000000, total expected: 2097152 Processed paths: 1000000, total expected: 4194304 Processed paths: 2000000, total expected: 4194304 Processed paths: 3000000, total expected: 4194304 Processed paths: 4000000, total expected: 4194304 Processed paths: 1000000, total expected: 2097152 Processed paths: 2000000, total expected: 2097152 Processed paths: 1000000, total expected: 33554432 Processed paths: 2000000, total expected: 33554432 .... Processed paths: 1826000000, total expected: 4398046511104 Processed paths: 1827000000, total expected: 4398046511104 Processed paths: 1828000000, total expected: 4398046511104 Processed paths: 1829000000, total expected: 4398046511104 Processed paths: 1830000000, total expected: 4398046511104 Processed paths: 1831000000, total expected: 4398046511104 Processed paths: 1832000000, total expected: 4398046511104 Processed paths: 1833000000, total expected: 4398046511104 `

So - what is going on - expected 4 x10^12 paths?

ffranr commented 6 years ago

@iqbal-lab I'm not surprised by a very large number of paths in this instance. A kmer size of 15 probably means that several sites must be considered simultaneously when evaluating all of the possible paths through a particular part of the PRG. All of the possible paths must be investigated to ensure kmers are not missed.

The number of paths scales as the product of the number of alleles considered across a set of sites.

iqbal-lab commented 6 years ago

Ah I see what the paths are.

iqbal-lab commented 6 years ago

How are they counted though? Is that denominator an estimate? Current status:

2806000000, total expected: 4398046511104

That does suggest that waiting for build to complete on 1 core is not going to work.

ffranr commented 6 years ago

The count is not an estimate. It is calculated from the number of alleles and number of sites that must be considered given a kmers size of 15.

iqbal-lab commented 6 years ago

OK, in that case any solution must involve parallelising I guess. I'll see how long the various k's take

iqbal-lab commented 6 years ago

Closing this then, there is no bug