samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

Can HTSJDK use a VCF index to quickly count total records in a VCF? #1586

Open bbimber opened 2 years ago

bbimber commented 2 years ago

Hello,

When working with a large VCF, iterating all features to determine the total variant count is slow. Can Can HTSJDK use a VCF index to quickly count total records in a VCF?

Thanks

cmnbroad commented 2 years ago

Someone else may have a more definitive answer, but I think the linear index part of a Tribble index (.idx) has that information, per-chromosome. I don't think tabix does.

lindenb commented 2 years ago

@cmnbroad well it should be possible as you can get this information with bcftools index -s in.vcf.gz

bbimber commented 2 years ago

exactly. i also didnt know this was possible, but bcftools apparently can do it. it would be very useful to be able to get variant count like this for big files.

yfarjoun commented 2 years ago

https://www.biostars.org/p/166414/#166448

bcf doesn't use a "pure" tabix index....

On Tue, Dec 14, 2021 at 11:05 AM bbimber @.***> wrote:

exactly. i also didnt know this was possible, but bcftools apparently can do it. it would be very useful to be able to get variant count like this for big files.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1586#issuecomment-993693752, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUVSEHMRVHBQGOBVIWLUQ5TNNANCNFSM5JZ2PKTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

lindenb commented 2 years ago

@yfarjoun with a recent version of bcftools, I'm able to extract the number of variants/chrom with a tbi index and bcftools index -s.

yfarjoun commented 2 years ago

I'm wondering: did you make the index with tabix or bcftools?

On Wed, Dec 15, 2021 at 4:24 PM Pierre Lindenbaum @.***> wrote:

@lindenb https://github.com/lindenb with a recent version of bcftools, I'm able to extract the number of variants/chrom with a tbi index and bcftools index -s.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samtools/htsjdk/issues/1586#issuecomment-995224786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAU6JUWTPP6LKXV2JBFPSFDUREBRDANCNFSM5JZ2PKTQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

lindenb commented 2 years ago

@yfarjoun bcftools. (but I think now both tools now use the same C code for tbi )

lindenb commented 2 years ago

@yfarjoun the C code collecting metadata is here : https://github.com/samtools/htslib/blob/1d79f449cb3b02eda8fc151556395b7b50ccd730/hts.c#L2857

Indexes (both .tbi and .csi) made by tabix include extra data about the indexed file. The returns a pointer to this data. Note that the data is stored exactly as it is in the index. Callers need to interpret the results themselves, including knowing what sort of data to expect byte swapping etc.

bbimber commented 2 years ago

all of our indexes are made by tabix and have this info, which makes sense if bcftools/tabix share the same code