mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

UnicodeDecodeError while reading file: how to explicit decode format? #146

Closed LucaCappelletti94 closed 6 years ago

LucaCappelletti94 commented 6 years ago

Hello, I wanted to parse a fasta file but I cannot seem able to identify a way to pass to the Fasta class a decode format. How can I proceed? Thanks!

Traceback (most recent call last):
  File "fasta_editor.py", line 7, in <module>
    for i, gene in genes:
  File "/home/cappelletti/code/.virtualenvs/virtual-py36gpu/lib/python3.6/site-packages/pyfaidx/__init__.py", line 822, in __iter__
    yield self[start:end]
  File "/home/cappelletti/code/.virtualenvs/virtual-py36gpu/lib/python3.6/site-packages/pyfaidx/__init__.py", line 806, in __getitem__
    return self._fa.get_seq(self.name, start + 1, stop)[::step]
  File "/home/cappelletti/code/.virtualenvs/virtual-py36gpu/lib/python3.6/site-packages/pyfaidx/__init__.py", line 1032, in get_seq
    seq = self.faidx.fetch(name, start, end)
  File "/home/cappelletti/code/.virtualenvs/virtual-py36gpu/lib/python3.6/site-packages/pyfaidx/__init__.py", line 624, in fetch
    seq = self.from_file(name, start, end)
  File "/home/cappelletti/code/.virtualenvs/virtual-py36gpu/lib/python3.6/site-packages/pyfaidx/__init__.py", line 676, in from_file
    seq = self.file.read(seq_blen).decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xec in position 6: invalid continuation byte

You can get the fasta file at fault by running the following, but be warned, once extracted is about 60GB:

wget ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

My code is just the following:

from pyfaidx import Fasta

path = "Homo_sapiens.GRCh38.dna.toplevel.fa"
genes = Fasta(path)

for i, gene in genes:
    if i > 2:
        break
    print(gene)
mdshw5 commented 6 years ago

I’ll try to reproduce this and let you know what I find. Thanks!

LucaCappelletti94 commented 6 years ago

Somehow how the same code works on the very same file, so I believe it was fault to some other variable. Sorry for eventually having wasted your time.

mdshw5 commented 6 years ago

No problem. These things happen. Probably during download some character corrupted and was not in the ASCII set. Glad you sorted it out!

On Nov 5, 2018, at 6:50 AM, Luca Cappelletti notifications@github.com wrote:

Somehow how the same code works on the very same file, so I believe it was fault to some other variable. Sorry for eventually having wasted your time.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.