iqbal-lab-org / gramtools

Genome inference from a population reference graph
MIT License
92 stars 15 forks source link

Benchmarking basic MHC (8haplos): build and quasimap #125

Open iqbal-lab opened 6 years ago

iqbal-lab commented 6 years ago

The human genome is 3billion bases and the MHC is 5Mb long. This PRG just contains the 8 reference MHC haplotypes, but no other variation - so 99.8% of the genome has no variation. Final number in PRG alphabet is 23690.

gramtools version

  {
    "version_number": "0.5.0",
    "last_git_commit_hash": "d8a3082a921579e65081fa1932c42c4f2fb7953a",
    "truncated_git_commits": [
        "d8a3082 - Robyn Ffrancon, 1527688551 : enhancement: build command optionally skips building PRG",
        "2dac562 - Robyn Ffrancon, 1527601335 : enhancement: quasimap commands ensures that build command executed successfully",
        "760b759 - Robyn Ffrancon, 1527599820 : enhancement: build stops and returns non-zero if no variants sites found in prg",
        "f3b8cff - Robyn Ffrancon, 1527597315 : enhancment: removed unused skip optimisation code",
        "e22cd4f - Robyn Ffrancon, 1527590325 : fix: SA indexes associated with correct site-allele paths for allele encapsulated mappings"
    ]
}

Build benchmarks

I'll start on the cluster, which involves using shared machines, but my benchmarking machine is totally blocked benchmarking p. falciparum

kmer CPUs encode PRG (sec) generate FM index (sec) masks (sec) Total human experienced time kmer index (sec) max RAM
5 1 1.4 105.5 74 20 3 mins 350Mb
7 1 4 144 85 350 10mins 374Mb
9 1 1 109 71 45 4 mins 400Mb

Quasimap benchmarks

The vast majority of reads (99.8%) are irrelevant, and will be discarded immediately because they don't hit the kmer index. Mapping a huge fastq of NA12878 reads ...~ 747.5 million reads.

kmer CPUs Load data (sec) Quasimap (sec) Human exp time Reads/sec/CPU Mapped reads Mem Comments
5 1 37 ? ? ? ? ? ?
7 1 37 ? ? ? ? ? ?
9 1 28 30279 8 hrs 43 mins 24686 154141 1.8Mb untrimmed reads
9 1 39 104358 29 hours 7162 1.8Mb trimmed reads