Closed xiamaz closed 3 months ago
dataset | variants | time (total cpu time) | execution type |
---|---|---|---|
NA-12878 | 4,921,200 | 3:29min | hg37, vcf.gz output, effect+gnomad |
This is approx 23546 variants/s.
Others will use same if not otherwise specified.
CPU: i9-12900K RAM: 128GB DDR4 HDD: Samsung SSD 970 EVO Plus 2TB (mehari db) + Samsung SSD 990 Pro 2TB (mehari)
dataset | variants | time (total cpu time) | execution type |
---|---|---|---|
NA-12878 | 4,921,200 | 39:37.68 min | hg37, tsv output |
time ./vep --cache --dir ../vep-data --offline --refseq -i ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -o test.tsv
Use vep git. Download indexed vep cache for grch37:
$ curl -O https://ftp.ensembl.org/pub/release-111/variation/indexed_vep_cache/homo_sapiens_refseq_vep_111_GRCh37.tar.gz
Download fasta for grch37:
$ wget https://ftp.ensembl.org/pub/grch37/current/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz
dataset | variants | time (total cpu time) | execution type |
---|---|---|---|
NA-12878 | 4,921,200 | 0:52.368 min | hg37, vcf output, effect only |
NA-12878 | 4,921,200 | 3:45.14 min | hg37, gts output, gnomad+effect using varfish-annotator |
time jannovar annotate-vcf -d data/refseq_curated_105_hg19.ser -i ~/Git/mehari/misc/benchmark/NA-12878.hg19.norm.vcf.gz -o test.vcf
varfish-annotator \
annotate \
-XX:MaxHeapSize=10g \
-XX:+UseG1GC \
--release GRCh37 \
--self-test-chr1-only \
--ref-path ../vep/vep-data/homo_sapiens_refseq/111_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa \
--db-path ./prod_data/20210728/varfish-annotator-db-20210728b-grch37.h2.db \
--refseq-ser-path ./prod_data/20210728/jannovar-db-20210728-grch37/refseq_curated_105_hg19.ser \
--ensembl-ser-path ./prod_data/20210728/jannovar-db-20210728-grch37/ensembl_87_hg19.ser \
--input-vcf ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz \
--output-db-info testvarfish.info \
--output-gts testvarfish.gts
Use conda version of precompiled jannovar:
$ micromamba create -n jannovar jannovar-cli varfish-annotator-cli
$ micromamba activate jannovar
Download serialized database (curated refseq):
$ wget "https://zenodo.org/record/5410367/files/refseq_curated_105_hg19.ser?download=1"
The prebuilt jannovar database does not seem to work for GRCh37 notated chromosomes.
Use the following command to rename our chromosomes: (only required for jannovar)
$ cat > chr_map.txt <<- EOF
1 chr1
2 chr2
3 chr3
4 chr4
5 chr5
6 chr6
7 chr7
8 chr8
9 chr9
10 chr10
11 chr11
12 chr12
13 chr13
14 chr14
15 chr15
16 chr16
17 chr17
18 chr18
19 chr19
20 chr20
21 chr21
22 chr22
X chrX
Y chrY
MT chrM
EOF
$ bcftools annotate --rename-chrs chr_map.txt NA-12878.norm.vcf.gz -o NA-12878.hg19.norm.vcf.gz -Oz9
dataset | variants | time (total cpu time) | execution type |
---|---|---|---|
NA-12878 | 4,921,200 | 14:20.80 min | hg37, effect+gnomad |
oc run ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -l hg19 -a gnomad -d testout
Install open-cravat in a new conda env:
$ micromamba create -n open-cravat open-cravat
$ oc module install-base
$ oc module install gnomad
dataset | variants | time (total cpu time) | execution type |
---|---|---|---|
NA-12878 | 4,921,200 | 3:56.91 min | hg37, json.gz output, effect+gnomad |
$ time dotnet Nirvana-v3.18.1/Nirvana.dll -c Data/Cache/GRCh37/Both --sd Data/SupplementaryAnnotation/GRCh37 -r Data/References/Homo_sapiens.GRCh37.Nirvana.dat -i ~/Git/mehari/misc/benchmark/NA-12878.norm.vcf.gz -o NA-12878
Use precompiled binaries, dotnet60.
Initial run for all annotators finished.
More extensive comparison work will move to https://github.com/varfish-org/annotation-zoo.
Is your feature request related to a problem? Please describe. We need to have an idea how mehari compares performance-wise to other tools. This is also useful to avoid performance regressions when refactoring/applying changes to the existing code.
Describe the solution you'd like We need to define some test data files, a realistic database extract performance numbers. These steps should be easily repeatable for new versions and changes to databases. Input files to be annotated do not need to be changed.
Describe alternatives you've considered None
Additional context Any tools/scripts developed should be stored in this repo. Anything too large to be directly stored should be put into a public place, which is easily accessible.
At least Nirvana comments on performance issues when accessing DBs via docker volumes, this needs to be addressed by 1. benchmarking all tools without docker or additional virtualization and 2. analyzing performance issues using virtualization and whether these differ significantly between the most important tools (mehari, nirvana, vep)
Most tools provide native output formats (tsv, json) in addition to VCF. Significant performance differences need to be evaluated.
As a baseline, both hg37 and hg38 based databases should at least be annotated for mehari.
TODOs