varfish-org / mehari

VEP-like tool for sequence ontology and HGVS annotation of VCF files
MIT License
14 stars 1 forks source link

benchmark mehari #422

Closed xiamaz closed 3 months ago

xiamaz commented 3 months ago

Is your feature request related to a problem? Please describe. We need to have an idea how mehari compares performance-wise to other tools. This is also useful to avoid performance regressions when refactoring/applying changes to the existing code.

Describe the solution you'd like We need to define some test data files, a realistic database extract performance numbers. These steps should be easily repeatable for new versions and changes to databases. Input files to be annotated do not need to be changed.

Describe alternatives you've considered None

Additional context Any tools/scripts developed should be stored in this repo. Anything too large to be directly stored should be put into a public place, which is easily accessible.

At least Nirvana comments on performance issues when accessing DBs via docker volumes, this needs to be addressed by 1. benchmarking all tools without docker or additional virtualization and 2. analyzing performance issues using virtualization and whether these differ significantly between the most important tools (mehari, nirvana, vep)

Most tools provide native output formats (tsv, json) in addition to VCF. Significant performance differences need to be evaluated.

As a baseline, both hg37 and hg38 based databases should at least be annotated for mehari.

TODOs

xiamaz commented 3 months ago

Mehari results

dataset variants time (total cpu time) execution type
NA-12878 4,921,200 3:29min hg37, vcf.gz output, effect+gnomad

This is approx 23546 variants/s.

Test system

Others will use same if not otherwise specified.

CPU: i9-12900K RAM: 128GB DDR4 HDD: Samsung SSD 970 EVO Plus 2TB (mehari db) + Samsung SSD 990 Pro 2TB (mehari)

xiamaz commented 3 months ago

VEP Results

dataset variants time (total cpu time) execution type
NA-12878 4,921,200 39:37.68 min hg37, tsv output

Executed command

time ./vep --cache --dir ../vep-data --offline --refseq -i ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -o test.tsv

Setup notes

Use vep git. Download indexed vep cache for grch37:

$ curl -O https://ftp.ensembl.org/pub/release-111/variation/indexed_vep_cache/homo_sapiens_refseq_vep_111_GRCh37.tar.gz

Download fasta for grch37:

$ wget https://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz
xiamaz commented 3 months ago

Jannovar/Varfish annotator results

dataset variants time (total cpu time) execution type
NA-12878 4,921,200 0:52.368 min hg37, vcf output, effect only
NA-12878 4,921,200 3:45.14 min hg37, gts output, gnomad+effect using varfish-annotator

Executed command

time jannovar annotate-vcf -d data/refseq_curated_105_hg19.ser -i ~/Git/mehari/misc/benchmark/NA-12878.hg19.norm.vcf.gz -o test.vcf
varfish-annotator \
    annotate \
    -XX:MaxHeapSize=10g \
    -XX:+UseG1GC \
    --release GRCh37 \
    --self-test-chr1-only \
    --ref-path ../vep/vep-data/homo_sapiens_refseq/111_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa \
    --db-path ./prod_data/20210728/varfish-annotator-db-20210728b-grch37.h2.db \
    --refseq-ser-path ./prod_data/20210728/jannovar-db-20210728-grch37/refseq_curated_105_hg19.ser \
    --ensembl-ser-path ./prod_data/20210728/jannovar-db-20210728-grch37/ensembl_87_hg19.ser \
    --input-vcf ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz \
    --output-db-info testvarfish.info \
    --output-gts testvarfish.gts

Setup notes

Use conda version of precompiled jannovar:

$ micromamba create -n jannovar jannovar-cli varfish-annotator-cli
$ micromamba activate jannovar

Download serialized database (curated refseq):

$ wget "https://zenodo.org/record/5410367/files/refseq_curated_105_hg19.ser?download=1"

The prebuilt jannovar database does not seem to work for GRCh37 notated chromosomes.

Use the following command to rename our chromosomes: (only required for jannovar)

$ cat > chr_map.txt <<- EOF
1 chr1
2 chr2
3 chr3
4 chr4
5 chr5
6 chr6
7 chr7
8 chr8
9 chr9
10 chr10
11 chr11
12 chr12
13 chr13
14 chr14
15 chr15
16 chr16
17 chr17
18 chr18
19 chr19
20 chr20
21 chr21
22 chr22
X chrX
Y chrY
MT chrM
EOF
$ bcftools annotate --rename-chrs chr_map.txt NA-12878.norm.vcf.gz -o NA-12878.hg19.norm.vcf.gz -Oz9
xiamaz commented 3 months ago

open-cravat results

dataset variants time (total cpu time) execution type
NA-12878 4,921,200 14:20.80 min hg37, effect+gnomad

Executed command

oc run ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -l hg19 -a gnomad -d testout

Setup notes

Install open-cravat in a new conda env:

$ micromamba create -n open-cravat open-cravat
$ oc module install-base
$ oc module install gnomad
xiamaz commented 3 months ago

Nirvana results

dataset variants time (total cpu time) execution type
NA-12878 4,921,200 3:56.91 min hg37, json.gz output, effect+gnomad

Executed command

$ time dotnet Nirvana-v3.18.1/Nirvana.dll -c Data/Cache/GRCh37/Both --sd Data/SupplementaryAnnotation/GRCh37 -r Data/References/Homo_sapiens.GRCh37.Nirvana.dat -i ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -o NA-12878

Setup notes

Use precompiled binaries, dotnet60.

xiamaz commented 3 months ago

Initial run for all annotators finished.

More extensive comparison work will move to https://github.com/varfish-org/annotation-zoo.