samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

Version 1.18 of tabix exhibits poorer performance in querying locus information compared to version 1.16. #1661

Closed TuBieJun closed 10 months ago

TuBieJun commented 10 months ago

Hi,I have observed a phenomenon where Tabix version 1.18 is slower in querying locus information compared to version 1.16. Here are my command and benchmark result:

image

Repeat this many times and the result is the same: image

The command of creating index is:

/data/users/liteng/my_software/bcftools-1.18/htslib-1.18/bin/tabix -s1 -b2 -e2 -C -f dbsnp_155_5cols_format_sorted.txt.gz

My file look like this:
image

And the query region file look like this: image

Is it due to my improper usage or are there some unknown issues with version 1.18? The background for conducting this benchmark is that we aim to utilize Tabix and BCFtools to develop a cloud-based application that can efficiently retrieve user genotypes based on rsID from BCF files. We are somewhat sensitive to this performance difference.

jkbonfield commented 10 months ago

I've tried and cannot reproduce this. Infact 1.18 is faster than 1.16. Were all your binaries built with the same compiler and compiler version, with the same options?

Also, are you using -T or -R for queries? Have you tested speeds on both? At some point -T becomes faster than -R (where the density of hits becomes sufficient). This feels wrong, and it ought to asymptotically approach instead of passing. It implies the index jumping option (-R) is unnecessarily decoding things multiple times. We fixed this in the multi-region iterator for SAM and BAM, but I guess tabix has its own iterators. That's a different issue though and not related to 1.16 vs 1.18.

TuBieJun commented 10 months ago

Alright, I'll go and confirm the versions and parameters of the compilation tools for versions 1.16 and 1.18. Anyway, thx!