benchmark mehari - Githubissues

xiamaz commented 3 months ago

Is your feature request related to a problem? Please describe. We need to have an idea how mehari compares performance-wise to other tools. This is also useful to avoid performance regressions when refactoring/applying changes to the existing code.

Describe the solution you'd like We need to define some test data files, a realistic database extract performance numbers. These steps should be easily repeatable for new versions and changes to databases. Input files to be annotated do not need to be changed.

Describe alternatives you've considered None

Additional context Any tools/scripts developed should be stored in this repo. Anything too large to be directly stored should be put into a public place, which is easily accessible.

At least Nirvana comments on performance issues when accessing DBs via docker volumes, this needs to be addressed by 1. benchmarking all tools without docker or additional virtualization and 2. analyzing performance issues using virtualization and whether these differ significantly between the most important tools (mehari, nirvana, vep)

Most tools provide native output formats (tsv, json) in addition to VCF. Significant performance differences need to be evaluated.

As a baseline, both hg37 and hg38 based databases should at least be annotated for mehari.

TODOs

[x] rerun jannovar with gnomad annotation

xiamaz commented 3 months ago

Mehari results

dataset	variants	time (total cpu time)	execution type
NA-12878	4,921,200	3:29min	hg37, vcf.gz output, effect+gnomad

This is approx 23546 variants/s.

Test system

Others will use same if not otherwise specified.

CPU: i9-12900K RAM: 128GB DDR4 HDD: Samsung SSD 970 EVO Plus 2TB (mehari db) + Samsung SSD 990 Pro 2TB (mehari)

xiamaz commented 3 months ago

VEP Results

dataset	variants	time (total cpu time)	execution type
NA-12878	4,921,200	39:37.68 min	hg37, tsv output

Executed command

time ./vep --cache --dir ../vep-data --offline --refseq -i ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -o test.tsv

Setup notes

Use vep git. Download indexed vep cache for grch37:

$ curl -O https://ftp.ensembl.org/pub/release-111/variation/indexed_vep_cache/homo_sapiens_refseq_vep_111_GRCh37.tar.gz

Download fasta for grch37:

$ wget https://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz

xiamaz commented 3 months ago

Jannovar/Varfish annotator results

dataset	variants	time (total cpu time)	execution type
NA-12878	4,921,200	0:52.368 min	hg37, vcf output, effect only
NA-12878	4,921,200	3:45.14 min	hg37, gts output, gnomad+effect using varfish-annotator

Executed command

time jannovar annotate-vcf -d data/refseq_curated_105_hg19.ser -i ~/Git/mehari/misc/benchmark/NA-12878.hg19.norm.vcf.gz -o test.vcf

varfish-annotator \
    annotate \
    -XX:MaxHeapSize=10g \
    -XX:+UseG1GC \
    --release GRCh37 \
    --self-test-chr1-only \
    --ref-path ../vep/vep-data/homo_sapiens_refseq/111_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa \
    --db-path ./prod_data/20210728/varfish-annotator-db-20210728b-grch37.h2.db \
    --refseq-ser-path ./prod_data/20210728/jannovar-db-20210728-grch37/refseq_curated_105_hg19.ser \
    --ensembl-ser-path ./prod_data/20210728/jannovar-db-20210728-grch37/ensembl_87_hg19.ser \
    --input-vcf ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz \
    --output-db-info testvarfish.info \
    --output-gts testvarfish.gts

Setup notes

Use conda version of precompiled jannovar:

$ micromamba create -n jannovar jannovar-cli varfish-annotator-cli
$ micromamba activate jannovar

Download serialized database (curated refseq):

$ wget "https://zenodo.org/record/5410367/files/refseq_curated_105_hg19.ser?download=1"

The prebuilt jannovar database does not seem to work for GRCh37 notated chromosomes.

Use the following command to rename our chromosomes: (only required for jannovar)

$ cat > chr_map.txt <<- EOF
1 chr1
2 chr2
3 chr3
4 chr4
5 chr5
6 chr6
7 chr7
8 chr8
9 chr9
10 chr10
11 chr11
12 chr12
13 chr13
14 chr14
15 chr15
16 chr16
17 chr17
18 chr18
19 chr19
20 chr20
21 chr21
22 chr22
X chrX
Y chrY
MT chrM
EOF
$ bcftools annotate --rename-chrs chr_map.txt NA-12878.norm.vcf.gz -o NA-12878.hg19.norm.vcf.gz -Oz9

xiamaz commented 3 months ago

open-cravat results

dataset	variants	time (total cpu time)	execution type
NA-12878	4,921,200	14:20.80 min	hg37, effect+gnomad

Executed command

oc run ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -l hg19 -a gnomad -d testout

Setup notes

Install open-cravat in a new conda env:

$ micromamba create -n open-cravat open-cravat
$ oc module install-base
$ oc module install gnomad

xiamaz commented 3 months ago

Nirvana results

dataset	variants	time (total cpu time)	execution type
NA-12878	4,921,200	3:56.91 min	hg37, json.gz output, effect+gnomad

Executed command

$ time dotnet Nirvana-v3.18.1/Nirvana.dll -c Data/Cache/GRCh37/Both --sd Data/SupplementaryAnnotation/GRCh37 -r Data/References/Homo_sapiens.GRCh37.Nirvana.dat -i ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -o NA-12878

Setup notes

Use precompiled binaries, dotnet60.

xiamaz commented 3 months ago

Initial run for all annotators finished.

More extensive comparison work will move to https://github.com/varfish-org/annotation-zoo.

varfish-org / mehari

benchmark mehari #422

Mehari results

Test system

VEP Results

Executed command

Setup notes

Jannovar/Varfish annotator results

Executed command

Setup notes

open-cravat results

Executed command

Setup notes

Nirvana results

Executed command

Setup notes