iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Benchmarking Plasmodium: build and quasimap #117

Closed iqbal-lab closed 4 years ago

iqbal-lab commented 6 years ago

Benchmarking build+quasimap of the new plasmodium PRG built from 2.4 million variants + DBLMSP 1+2. Callign this dataset pf_wg_and_dblmsps_v1. PRG here: /nfs/research1/zi/projects/gramtools/standard_datasets/pfalciparum/pf3k_release3_cortex_plus_dblmsps/

I'm raising this now, will fill in the table over next day/two, and then we can close or follow up. Updated with new commit, and now using dedicated and non-shared server

gramtools --version
{
    "version_number": "0.5.0",
    "last_git_commit_hash": "12ab776c2452aacd6c8a2248c834907b62b93f66",
    "truncated_git_commits": [
        "12ab776 - Robyn Ffrancon, 1528306963 : enhancement: added kmer size to stdout status message",
        "42018b4 - Robyn Ffrancon, 1528305111 : fix: when generating all kmers, correct ordering assured; increased kmer size threshold: <= 10",
        "72786ae - Robyn Ffrancon, 1528213271 : fix: travis build log within memory limit",
        "e440cc1 - Robyn Ffrancon, 1528202882 : enhancement: note added to stdout to explain that read counts include reverse complement",
        "58608bf - Robyn Ffrancon, 1528201130 : enhancement: infer command can produce vcf output if build used vcf + reference"
    ]
}

Build benchmarks

example command line

gramtools build --gram-directory ./gramk6 --vcf prg_construction/data/pf3k_and_DBPMSPS1and2.vcf --reference reference/Pfalciparum.genome.fasta --max-read-length 150 --kmer-size 6 --debug
kmer CPUs encode PRG (s) gen. FM index (s) masks (s) kmer index (s) max RAM
7 1 2 127 169 2259s (44mins) 52Gb
9 1 2 127 168 3810 (1h11mins) 79Gb
10 1 2 128 170 4704 (1h 24 mins) 94 Gb

Quasimapping benchmark

example command

gramtools quasimap --gram-directory gramk5/ --reads temp/out.fq.gz --kmer-size 5 --debug --max-threads 10

We are mapping 33,557,648 reads (these have been quality trimmed, length <=76bp) sequenced from GB4 strain.

Reads:

/nfs/research1/zi/projects/gramtools/standard_datasets/pfalciparum/pf3k_release3_cortex_plus_dblmsps/temp/GB4.fastq.trimmed.gz

On old commit, d8a3082a921579e65081fa1932c42c4f2fb7953a

kmer CPUs Load data (sec) Quasimap (sec) Human experienced time Reads/sec/CPU
5 1 37 17784s (5hrs) 5 hrs 1887
5 8 37 29117s (8hrs) 1hr 45mins ?

On current commit 12ab776c2452aacd6c8a2248c834907b62b93f66

kmer CPUs Load data (sec) Quasimap (sec) Human experienced time Reads/sec/CPU
10 8 43 139639 (38.8 hrs) 7 hrs 10 mins ?
iqbal-lab commented 6 years ago

Rerunning the above with new tip (12ab776c2452aacd6c8a2248c834907b62b93f66) - will replace all the above numbers

bricoletc commented 5 years ago

Benchmarking now from 8b46a86a

Build (using --all-kmers)

Data in: (yoda cluster) Vcf: /nfs/leia/research/iqbal/bletcher/Pf_benchmark/ref_data/pf3k_and_DBPMSPS1and2.vcf Ref: /nfs/leia/research/iqbal/bletcher/Pf_benchmark/ref_data/Pfalciparum.genome.fasta

kmer num_kmers CPUs encode PRG (s) gen. FM index (s) masks (s) kmer index (s) max RAM
11 4194304 1 2 131 182 5927 (1h 38 mins) 101.4 Gb
bricoletc commented 4 years ago

I will now close this issue and open a fresh one.

The reason is that we use to build the PRG string using zam's perl module and it skipped any overlapping variants in the VCF.

Now we use cluster_vcf_records which does deal with them.

The consequence is that the constructed PRG string is much bigger.

In 8b46a86 the PRG string in this dataset contains 79,687,317 distinct integers ( DNA + variant markers) on this benchmarking dataset, whereas in a6e9094 which introduces the change there are 457,513,813. This is 5.7 times bigger.

I'll now run benchmarks on this

iqbal-lab commented 4 years ago

Go Brice!