samtools / samtools

Tools (written in C using htslib) for manipulating next-generation sequencing data
http://htslib.org/
Other
1.61k stars 577 forks source link

Automatically generate indexes after extracting sequences with faidx (--write-index) #2118

Open fgvieira opened 1 week ago

fgvieira commented 1 week ago

Is your feature request related to a problem? Please specify.

Right now, after extracting sequences with samtools faidx, one has to afterwards generate the FAI and GZI files.

Describe the solution you would like.

It would be nice if it was possible to automatically generate these files from the fasta output from samtools faidx (similar to samtools view --write-index):

samtools faidx -r targets.regions --write-index --output targets .fas.gz sequences.fas.gz

daviesrob commented 6 days ago

Thanks for the suggestion, we'll take a look at implementing this.

ASLeonard commented 1 day ago

It is not very generalised, but adding (behind a --write-index flag)

fai_build(output_file);

after exit1 around here https://github.com/samtools/samtools/blob/eb0992ff8a99b895364a7a861b418a1beb77d540/faidx.c#L515 seems to work fine for bgzip or normal output. Surprisingly didn't crash if writing to stdout, but maybe it already exits beforehand or fai_build knows to ignore indexing stdout.

jkbonfield commented 22 hours ago

We'd come to a similar conclusion that the easy fix of just doing a second pass of reading the output and indexing it is probably the best starting point. It's unlikely it needs to be high performant and needing to avoid two passes (as we manage in BAM, VCF.gz, etc) and is more a desire for simplicity.