mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
449 stars 75 forks source link

getting UnicodeDecodeError when running faidx with a bed file input #217

Closed yonniejon closed 7 months ago

yonniejon commented 7 months ago

Hi!

I am running faidx version 0.7.2.1

I am running it with a bed file input like so: faidx hg19/genome.fa.gz -b tmp.bed.gz

where tmp.bed.gz looks like: chr6 132891948 132892108 chr10 127585142 127585221

I get the following error: Traceback (most recent call last): File "/cs/usr/jrosensk/.local/bin/faidx", line 8, in sys.exit(main()) File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 202, in main write_sequence(args) File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 26, in write_sequence for region in regions_to_fetch: File "/usr/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I assume the problem is that my bed file has the "chr" prefix? It is a problem because my genome file has the chr prefix as well. Is there way around this or I need to change the reference .fa file?

yonniejon commented 7 months ago

So the problem was no the chr prefix. I replaced my bed file to not contain the "chr" prefixes and I removed the "chr" prefixes in my fasta reference file and the problem persists.

mdshw5 commented 7 months ago

This means that you have a non utf-8 character at the beginning of your file. Did you by chance export this from MS Excel as utf-16? If so then you need to convert your file to utf-8 encoding. You can also export from Excel in utf-8 encoding as well.

yonniejon commented 7 months ago

I did not. I ran nano tmp.bed and pasted the following contents exactly:

chr6 132891948 132892108 chr10 127585142 127585221

mdshw5 commented 7 months ago

Just to confirm - you have said:

where tmp.bed.gz looks like: chr6 132891948 132892108 chr10 127585142 127585221

Do you mean that the tmp.bed file contains this, and you have also gzipped it? If so I think I understand the issue. The --bed option does not handle gzipped input. If you want to pass a gzipped file you could do:

$ faidx hg19/genome.fa.gz -b - <( gzip -dc tmp.bed.gz)

The above would use a sub shell to decompress your bed file and send it to stdin, which can be read by the --bed argument using the "-" symbol. You could alternatively pass an uncompressed bed file.

yonniejon commented 7 months ago

"Do you mean that the tmp.bed file contains this, and you have also gzipped it?"

Yes you are correct. But I only gzipped it because when I ran it without gzip/bgzip I got the following error:

faidx genome.fa.gz -b tmp.bed

Traceback (most recent call last): File "/cs/usr/jjj/.local/bin/faidx", line 8, in sys.exit(main()) File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 202, in main write_sequence(args) File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 53, in write_sequence for line in fetch_sequence(args, fasta, name, start, end): File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/cli.py", line 70, in fetch_sequence sequence = fasta[name][start:end] File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 920, in getitem return self._fa.get_seq(self.name, start + 1, stop)[::step] File "/cs/usr/jrosensk/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 1149, in get_seq seq = self.faidx.fetch(name, start, end) File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 727, in fetch seq = self.from_file(name, start, end) File "/cs/usr/jjj/.local/lib/python3.9/site-packages/pyfaidx/init.py", line 769, in from_file self.file.seek(i.offset) File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 650, in seek self._load_block(start_offset) File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 611, in _load_block block_size, self._buffer = _load_bgzf_block(handle, self._text) File "/usr/lib/python3/dist-packages/Bio/bgzf.py", line 444, in _load_bgzf_block raise ValueError( ValueError: A BGZF (e.g. a BAM file) block should start with b'\x1f\x8b\x08\x04', not b'\xea^\x8b\xb0'; handle.tell() now says 16541

mdshw5 commented 7 months ago

Ah I see. That error message is telling you that the FASTA file cannot be gzip compressed. You can however use block-gzip compression to compress the FASTA file. See https://www.htslib.org/doc/bgzip.html

yonniejon commented 7 months ago

Got it! Thanks. Sorry about the confusion!

mdshw5 commented 7 months ago

No worries - glad to help!