mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

Feature request: get sequence length #176

Closed marco-mariotti closed 3 years ago

marco-mariotti commented 3 years ago

Hi again! in various settings one may want to know the length of a sequence in the fasta file without reading the sequence itself. For example, when manipulating genomic coordinate structures such as gene annotation, one can easily get out of bounds if we're extending coordinates, so chromosome length must be checked. Could you implement such a method (e.g. get_seq_len(self, name) ) on the Fasta object? From a glance at pyfaidx code, the Faidx object stores sequence lengths in memory, so it should be trivial to access them without fetching the sequences

Thanks in advance!

mdshw5 commented 3 years ago

I'm glad to report that this feature already exists 😉. You can call len() on a FastaRecord to get the pre-calculated sequence length from the index. FastaRecord has a method for this:

https://github.com/mdshw5/pyfaidx/blob/d35a73c8fb0617c3d679e6b3791e94b98c1446ad/pyfaidx/__init__.py#L869-L870

Instead of your proposed get_seq_len(self, name) method in the Fasta object, you can simply call len() after getting the sequence by name:

example = Fasta("example.fa")
example1_len = len(example["1"])
marco-mariotti commented 3 years ago

Terrific, thanks!