samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
278 stars 244 forks source link

Question: Tabix index to find min(POS) and max(POS) in each contig #1652

Open lindenb opened 1 year ago

lindenb commented 1 year ago

Hi all, just a question about the API. I want to scan a vcf.gz and it's tbi to extract the min/max variant.POS for each contig.

I'm not sure if I should use the Chunk, the LinearIndex etc..

So here is a snippet of my code so far where I use the first and last entries of the linearIndex.

  final TabixIndex tbi = new TabixIndex(...)
  final List<String> contigs = tbi.getSequenceNames();
  final BinningIndexContent[] binIndexContents = tbi.getIndices();
  (...)
  //loop over each contig
  for(int tid = 0; tid < contigs.size();tid++) {
    // name of the current chromosome
    final String contig = contigs.get(tid);
    // linear index for the current chromosome
    final LinearIndex linearIndex = binIndexContents[tid].getLinearIndex();

        // virtual offset for the first variant. Is it OK ? are the offset ordered ?
    long offset=linearIndex.get(0);
    blocCompressedInputStream.seek(offset);
        // extract first variant
    li = vcfCodec.makeSourceFromStream(blocCompressedInputStream);
    if(!li.hasNext()) continue;
    final VariantContext firstVariant = vcfCodec.decode(li);
    if(firstVariant==null) {
        System.err.println("No variant for "+contig);
        continue;
        }
     // virtual offset for the last variant.
    offset=linearIndex.get(linearIndex.size()-1);
    blocCompressedInputStream.seek(offset);
    li = vcfCodec.makeSourceFromStream(blocCompressedInputStream);
    VariantContext lastVariant = firstVariant;
    while(li.hasNext()) {
        final VariantContext ctx2 = vcfCodec.decode(li);
        if(ctx2==null) break;
        if(!ctx2.getContig().equals(contig)) break;
        lastVariant = ctx2;
        }
(...)

does it look ok ? or should I use another method ? another class ?