db_download UnicodeDecodeError

rturba commented 1 year ago

Hello,

I am running crabs v.0.1.8 on a Linux HPC. I was trying to download sequences from BOLD using the following command:

(crabs) [rturba@n1935 eDNA]$ crabs-v8 db_download --source bold --database 'Chordata' --output CO1_bold.fasta --keep_original yes --marker 'COI-5P'

However, I'm receiving a coding error. Has anyone encountered this issue before? It seems like the issue would be with the encoding in the database, maybe with some species name. Is there a way I can work around this issue on my end?

downloading sequences from BOLD
CRABS_bold_download     [ <=>                ] 493.28M   552KB/s    in 13m 8s
Traceback (most recent call last):
  File "/u/home/r/rturba/bin/crabs-v8", line 1462, in <module>
    main()
  File "/u/home/r/rturba/bin/crabs-v8", line 1459, in main
    args.func(args)
  File "/u/home/r/rturba/bin/crabs-v8", line 127, in db_download
    bold_file = bold_download(DATABASE, MARKER)
  File "/u/home/r/rturba/programs/reference_database_creator/function/module_db_download.py", line 296, in bold_download
    num_bold = len(list(SeqIO.parse(filename, 'fasta')))
  File "/u/home/r/rturba/.conda/envs/crabs/lib/python3.6/site-packages/Bio/SeqIO/Interfaces.py", line 73, in __next__
    return next(self.records)
  File "/u/home/r/rturba/.conda/envs/crabs/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 198, in iterate
    for title, sequence in SimpleFastaParser(handle):
  File "/u/home/r/rturba/.conda/envs/crabs/lib/python3.6/site-packages/Bio/SeqIO/FastaIO.py", line 60, in SimpleFastaParser
    for line in handle:
  File "/u/home/r/rturba/.conda/envs/crabs/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2223: ordinal not in range(128)

rturba commented 1 year ago

Strangely, I did get an output CRABS file with 632378 lines. When I check the BOLD DB and search for chordata, I get:

Found 594413 published records,
forming 44979 BINs (clusters),
with specimens from 240 countries,
deposited in 880 institutions.

Of these records, 520529 have species names, and represent 39476 species.

So, I'm not sure if my output is right. It seems to have more records than in the DB!? I'm not sure how the error above was handled.

gjeunen commented 1 year ago

Hello @rturba,

It seems one of the entries you are downloading is not encoded in UTF-8 format, which is what CRABS is expecting. There might've been a connection error? I've reran your code, but could not recreate the error. Please see the output below:

crabs db_download --source bold --database 'Chordata' --output coi_bold.fasta --keep_original yes --marker 'COI-5P'

downloading sequences from BOLD
CRABS_bold_download.fasta                          [                                                                                 <=>                 ] 493.28M   589KB/s    in 13m 44s 
downloaded 632378 sequences from BOLD
formatting 632378 sequences to CRABS format
 94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏       | 486099367/517241376 [00:04<00:00, 101179175.80it/s]
found 9621 sequences with incorrect accession format
written 601327 sequences to coi_bold.fasta

It seems the error occurred right after downloading the final sequence, as the downloaded file size is identical between our outputs in the Terminal window.

With regards to the CRABS output file, please ensure it is correctly formatted after the error report you encountered. If incorrectly formatted, it might cause problems downstream. It seems CRABS is downloading 632,378 sequences from BOLD based on the code you ran, after removing sequences due to incorrect formatting, 601,327 sequences are written to the output file. The difference between the online BOLD search (594,413 sequences) and CRABS downloaded sequences could result from different settings between the online search tool and the ftp download function.

One bug I notice from the Terminal window output is that the reported number of filtered sequences due to incorrect formatting is not matching the actual sequences that are removed. This will not affect your results though, but I will fix this in the next version.

Please let me know if you have any further questions.

Best, Gert-Jan

rturba commented 1 year ago

Damn, for some reason I cannot move forward with this. I've tried repeating the command several times with no luck. I've also checked with the help desk from the cluster, and strangely, the IT person was able to complete the download on their account :(

What version of python are you using? Mine is Python 3.6.15.

gjeunen commented 1 year ago

Since another account on the cluster managed to complete the download without issues, I doubt the python version will be the issue. I'm using 3.11.5.

Does this problem persists for other downloads as well, or only BOLD, or only BOLD + Chondrichthyes?

rturba commented 1 year ago

Before, I was able to run the NCBI download with no problems. Strangely, I did a test with BOLD using 'Mammalia' as the database, and I was able to download it with no errors 🤔

gjeunen commented 1 year ago

Apologies, not sure what could cause the problem, as we're not able to recreate it. Since it is only that particular download, could you transfer the file from the IT person?

rturba commented 1 year ago

No worries! Thanks for the help, though. If I figure it out I'll let you know.

gjeunen / reference_database_creator

db_download UnicodeDecodeError #41