Benchmarking Plasmodium: build and quasimap

iqbal-lab commented 6 years ago

Benchmarking build+quasimap of the new plasmodium PRG built from 2.4 million variants + DBLMSP 1+2. Callign this dataset pf_wg_and_dblmsps_v1. PRG here: /nfs/research1/zi/projects/gramtools/standard_datasets/pfalciparum/pf3k_release3_cortex_plus_dblmsps/

I'm raising this now, will fill in the table over next day/two, and then we can close or follow up. Updated with new commit, and now using dedicated and non-shared server

gramtools --version
{
    "version_number": "0.5.0",
    "last_git_commit_hash": "12ab776c2452aacd6c8a2248c834907b62b93f66",
    "truncated_git_commits": [
        "12ab776 - Robyn Ffrancon, 1528306963 : enhancement: added kmer size to stdout status message",
        "42018b4 - Robyn Ffrancon, 1528305111 : fix: when generating all kmers, correct ordering assured; increased kmer size threshold: <= 10",
        "72786ae - Robyn Ffrancon, 1528213271 : fix: travis build log within memory limit",
        "e440cc1 - Robyn Ffrancon, 1528202882 : enhancement: note added to stdout to explain that read counts include reverse complement",
        "58608bf - Robyn Ffrancon, 1528201130 : enhancement: infer command can produce vcf output if build used vcf + reference"
    ]
}

Build benchmarks

example command line

gramtools build --gram-directory ./gramk6 --vcf prg_construction/data/pf3k_and_DBPMSPS1and2.vcf --reference reference/Pfalciparum.genome.fasta --max-read-length 150 --kmer-size 6 --debug

kmer	CPUs	encode PRG (s)	gen. FM index (s)	masks (s)	kmer index (s)	max RAM
7	1	2	127	169	2259s (44mins)	52Gb
9	1	2	127	168	3810 (1h11mins)	79Gb
10	1	2	128	170	4704 (1h 24 mins)	94 Gb

Quasimapping benchmark

example command

gramtools quasimap --gram-directory gramk5/ --reads temp/out.fq.gz --kmer-size 5 --debug --max-threads 10

We are mapping 33,557,648 reads (these have been quality trimmed, length <=76bp) sequenced from GB4 strain.

Reads:

/nfs/research1/zi/projects/gramtools/standard_datasets/pfalciparum/pf3k_release3_cortex_plus_dblmsps/temp/GB4.fastq.trimmed.gz

On old commit, d8a3082a921579e65081fa1932c42c4f2fb7953a

kmer	CPUs	Load data (sec)	Quasimap (sec)	Human experienced time	Reads/sec/CPU
5	1	37	17784s (5hrs)	5 hrs	1887
5	8	37	29117s (8hrs)	1hr 45mins	?

On current commit 12ab776c2452aacd6c8a2248c834907b62b93f66

kmer	CPUs	Load data (sec)	Quasimap (sec)	Human experienced time	Reads/sec/CPU
10	8	43	139639 (38.8 hrs)	7 hrs 10 mins	?

iqbal-lab commented 6 years ago

Rerunning the above with new tip (12ab776c2452aacd6c8a2248c834907b62b93f66) - will replace all the above numbers

bricoletc commented 5 years ago

Benchmarking now from 8b46a86a

Build (using --all-kmers)

Data in: (yoda cluster) Vcf: /nfs/leia/research/iqbal/bletcher/Pf_benchmark/ref_data/pf3k_and_DBPMSPS1and2.vcf Ref: /nfs/leia/research/iqbal/bletcher/Pf_benchmark/ref_data/Pfalciparum.genome.fasta

kmer	num_kmers	CPUs	encode PRG (s)	gen. FM index (s)	masks (s)	kmer index (s)	max RAM
11	4194304	1	2	131	182	5927 (1h 38 mins)	101.4 Gb

bricoletc commented 4 years ago

I will now close this issue and open a fresh one.

The reason is that we use to build the PRG string using zam's perl module and it skipped any overlapping variants in the VCF.

Now we use cluster_vcf_records which does deal with them.

The consequence is that the constructed PRG string is much bigger.

In 8b46a86 the PRG string in this dataset contains 79,687,317 distinct integers ( DNA + variant markers) on this benchmarking dataset, whereas in a6e9094 which introduces the change there are 457,513,813. This is 5.7 times bigger.

I'll now run benchmarks on this

iqbal-lab commented 4 years ago

Go Brice!

iqbal-lab-org / gramtools