samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
276 stars 244 forks source link

Question: How to generate FAI files and TBI files #1665

Open Lys9527 opened 1 year ago

Lys9527 commented 1 year ago

I recently used htsjdk 3.0.5 to complete my own project, I would like to ask how htsjdk needs to be used to generate fai files and tbi files, thank you

lindenb commented 1 year ago

FAI:


https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/reference/FastaSequenceIndexCreator.java

TBI:

private static int copyTo(final Path invcf,final Path outvcfpath) throws IOException {
    try(VCFReader vcfReader1  = new VCFFileReader(invcf,false)) {
        final VCFHeader header1 = vcfReader1.getHeader();
        final VariantContextWriterBuilder vcb = new VariantContextWriterBuilder();
            final SAMSequenceDictionary dict1 = header1.getSequenceDictionary();
            vcb.setReferenceDictionary(header1.getSequenceDictionary());
            vcb.setIndexCreator(new TabixIndexCreator(dict1, TabixFormat.VCF));
            vcb.setOption(Options.INDEX_ON_THE_FLY);
            vcb.setOutputPath(outvcfpath);
            vcb.setCreateMD5(false);

            try(final VariantContextWriter w = vcb. build()) {
                w.writeHeader(header1);
                try(final CloseableIterator<VariantContext> iter1 = vcfReader1.iterator()) {
                    while(iter1.hasNext()) {
                        w.add(iter1.next());
                        }
                    }
                }
        }

    return count;
    }
Lys9527 commented 1 year ago

FAI:


https://github.com/samtools/htsjdk/blob/master/src/main/java/htsjdk/samtools/reference/FastaSequenceIndexCreator.java

TBI:

private static int copyTo(final Path invcf,final Path outvcfpath) throws IOException {
  try(VCFReader vcfReader1  = new VCFFileReader(invcf,false)) {
      final VCFHeader header1 = vcfReader1.getHeader();
      final VariantContextWriterBuilder vcb = new VariantContextWriterBuilder();
          final SAMSequenceDictionary dict1 = header1.getSequenceDictionary();
          vcb.setReferenceDictionary(header1.getSequenceDictionary());
          vcb.setIndexCreator(new TabixIndexCreator(dict1, TabixFormat.VCF));
          vcb.setOption(Options.INDEX_ON_THE_FLY);
          vcb.setOutputPath(outvcfpath);
          vcb.setCreateMD5(false);

          try(final VariantContextWriter w = vcb. build()) {
              w.writeHeader(header1);
              try(final CloseableIterator<VariantContext> iter1 = vcfReader1.iterator()) {
                  while(iter1.hasNext()) {
                      w.add(iter1.next());
                      }
                  }
              }
      }

  return count;
  }

Thank you very much for your answer, I have understood the generation of fai file, for tai file, do you need to limit the format of the input file here? For example, gff file, sorry I am not learning this major, but I need to use this function in my project, so the problem is a little rude, please forgive me

lindenb commented 1 year ago

for TBI +BED, you can use

            TabixIndexCreator indexCreator=new TabixIndexCreator(TabixFormat.BED);
            writer = new BlockCompressedOutputStream(this.outputFile);

                long filePosition= writer.getFilePointer();
                while(in.hasNext())
                    {
                    String line = in.next();
                    BedLine bed = this.bedCodec.decode(line);
                    if(bed==null) continue;
                    writer.write(line.getBytes());
                    writer.write('\n');
                    indexCreator.addFeature(bed, filePosition);
                    filePosition = writer.getFilePointer();
                    }
                }
            writer.flush();
            Index index = indexCreator.finalizeIndex(writer.getFilePointer());

it's the same idea for GFF but used a gff3 codec: https://www.javadoc.io/static/com.github.samtools/htsjdk/2.24.1/htsjdk/tribble/gff/Gff3Codec.html

Lys9527 commented 1 year ago

for TBI +BED, you can use

          TabixIndexCreator indexCreator=new TabixIndexCreator(TabixFormat.BED);
          writer = new BlockCompressedOutputStream(this.outputFile);

              long filePosition= writer.getFilePointer();
              while(in.hasNext())
                  {
                  String line = in.next();
                  BedLine bed = this.bedCodec.decode(line);
                  if(bed==null) continue;
                  writer.write(line.getBytes());
                  writer.write('\n');
                  indexCreator.addFeature(bed, filePosition);
                  filePosition = writer.getFilePointer();
                  }
              }
          writer.flush();
          Index index = indexCreator.finalizeIndex(writer.getFilePointer());

it's the same idea for GFF but used a gff3 codec: https://www.javadoc.io/static/com.github.samtools/htsjdk/2.24.1/htsjdk/tribble/gff/Gff3Codec.html

Thank you very much for your detailed answer. If you can give me an example, for example, my file is /path/to/SCHZ.rRNA.gff.gz, how do I need to generate the “/path/to/SCHZ.rRNA.gff.gz.tbi” file?Sorry I'm new to the java building website project, very unproficient, bothering you!

gokalpcelik commented 1 year ago

Using codecs and IndexFactory should do fine I guess. Reiterating over the actual file may not be necessary

Here is my code from my repository

final VCFCodec codec = new VCFCodec();

        if (this.VCFFile.getAbsolutePath().endsWith(FileExtensions.VCF)) {
            idxpath = this.VCFFile.getAbsolutePath() + FileExtensions.TRIBBLE_INDEX;

            try {
                final Index idx = IndexFactory.createDynamicIndex(this.VCFFile, codec,
                        IndexBalanceApproach.FOR_SEEK_TIME);
                idx.write(new File(idxpath));
            } catch (final Exception e) {
                if (e instanceof UnsortedFileException) {
                    OverSeer.log(this.getClass().getSimpleName(),
                            "VCF File is not sorted. Sorting function is currently not implemented. Please sort your vcf using a proper tool.",
                            OverSeer.ERROR);
                } else {
                    e.printStackTrace();
                }

            }

        } else if (this.VCFFile.getAbsolutePath().endsWith(FileExtensions.COMPRESSED_VCF)) {
            idxpath = this.VCFFile.getAbsolutePath() + FileExtensions.TABIX_INDEX;

            try {
                final Index idx = IndexFactory.createIndex(this.VCFFile, codec, IndexType.TABIX);
                idx.write(new File(idxpath));
            } catch (final Exception e) {
                e.printStackTrace();
            }

        }

I bet it will work for GFF with suitable codec within htsjdk.