mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

Parser breaks if FASTA file does not contain > #144

Closed Benjamin-Lee closed 6 years ago

Benjamin-Lee commented 6 years ago

Ok this is a weird one: I'm at a hackathon and they handed us "mystery genomes" that were FASTA files with the comment line removed. I tried to use pyfaidx (through squiggle) and got this error:

  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 990, in __init__
    build_index=build_index)
  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 423, in __init__
    self.build_index()
  File "/Users/BenjaminLee/Desktop/Python/Research/hackseq18/env/lib/python3.6/site-packages/pyfaidx/__init__.py", line 573, in build_index
    rname, rlen, thisoffset, clen, blen))
TypeError: unsupported format string passed to NoneType.__format__
mdshw5 commented 6 years ago

Well, it's not a FASTA file without the description line. Are we talking about a file that starts with a semicolon (like this example)? In that case I could see adding support for FASTA comments.

If we're talking about file that just contain sequence and no comments or identifiers I doubt there's an indexing strategy for these, since a multi-FASTA file would have no record separator for multiple entries.

If you can provide a bit more detail about how you'd like this supported we can go from there. Thanks!

Benjamin-Lee commented 6 years ago

Ideally, it would parse it as normal. That being said, I understand if you don't think that supporting non properly formatted FASTA files is within the scope or even advisable for this project (we recently realized that Biopython doesn't support it either and fails silently). If so, could we maybe add a specific warning if no > is found rather than a generic error?

mdshw5 commented 6 years ago

Definitely adding better exceptions would be great. Can I have an example of the file format in question?

Benjamin-Lee commented 6 years ago

Sure! The exact file in question can be viewed here.

Basically instead of:

> description
ATGGACAGTA...
GATAGATACC...

it was getting passed:

ATGGACAGTA...
GATAGATACC...
mdshw5 commented 6 years ago

I've added a case for handling files with no valid description lines and pushed a new release (https://github.com/mdshw5/pyfaidx/releases/tag/v0.5.5.1) that should be on PyPI in a few minutes.

Benjamin-Lee commented 6 years ago

Thanks a ton!