Non UTF-8 characters in the your database creating parsing errors

carden24 commented 4 years ago

I run into problems parsing diamond alignments created with the latest version of superfocus ( SUPER-FOCUS 0.34, on Apr 2, 2019)

Generating output...  [31.191s]
Traceback (most recent call last):
  File "superfocus_v2.py", line 602, in <module>
    main()
  File "superfocus_v2.py", line 568, in main
    del_alignments)
  File "superfocus_v2.py", line 177, in parse_alignments
    for row in alignment_reader:
  File "/home/erick/edge/edge_v1.5/thirdParty/Anaconda2/envs/superfocus/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 277: invalid start byte

The issue is that one of your sequences in your fasta files in the database has non-utf characters.

I found them using this command:

grep -axv '.*' file.txt

The cultrip is this sequence:

fig|419947.9.peg.11041009Mycobacterial_MmpL5_membrane_protein_clusterRv0678MarR_family_transcriptional_regulator_associated_with_MmpL5MmpS5_efflux_system

Which apparently looks fine but if you check the characters, it has a weird one ^V=SYN (Synchronous idle). ^$ is the end of line character.

grep -axv '.*' 100_clusters.fasta

>fig|419947.9.peg.1104__1009__Mycobacterial_MmpL5_membrane_protein_cluster__Rv0678__MarR_family_transcriptional_regulator_associated_with_MmpL5M-^VMmpS5_efflux_system$

I found this problem in the 100_clusters.fasta file.

This issue can be solved by adding the option " , encoding='ISO-8859-1' " to the parse_alignments function of the do_alignment.py. Ideally you should try co fix your database issue first.

Before: with open(alignment) as alignment_file:

After: with open(alignment, encoding='ISO-8859-1') as alignment_file:

metageni commented 4 years ago

Thanks, @carden24. I will add the change into the next release.

Surprisingly, SUPER-FOCUS's users have formated the same database file and it is the first time I see this error.

Best

metageni commented 4 years ago

Fixed - Thanks

carden24 commented 4 years ago

It is a very unusual error indeed. You will only see it if you have a hit against that subject in the database. I do not know if it only shows in my version of python3 (3.6.10) or csv (1.0). Thanks for the quick fix. feel free to close the issue.

metageni commented 4 years ago

gotcha! thanks again.

metageni / SUPER-FOCUS

Non UTF-8 characters in the your database creating parsing errors #62