Closed rturba closed 1 month ago
Strangely, I did get an output CRABS file with 632378 lines. When I check the BOLD DB and search for chordata, I get:
Found 594413 published records,
forming 44979 BINs (clusters),
with specimens from 240 countries,
deposited in 880 institutions.
Of these records, 520529 have species names, and represent 39476 species.
So, I'm not sure if my output is right. It seems to have more records than in the DB!? I'm not sure how the error above was handled.
Hello @rturba,
It seems one of the entries you are downloading is not encoded in UTF-8 format, which is what CRABS is expecting. There might've been a connection error? I've reran your code, but could not recreate the error. Please see the output below:
crabs db_download --source bold --database 'Chordata' --output coi_bold.fasta --keep_original yes --marker 'COI-5P'
downloading sequences from BOLD
CRABS_bold_download.fasta [ <=> ] 493.28M 589KB/s in 13m 44s
downloaded 632378 sequences from BOLD
formatting 632378 sequences to CRABS format
94%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 486099367/517241376 [00:04<00:00, 101179175.80it/s]
found 9621 sequences with incorrect accession format
written 601327 sequences to coi_bold.fasta
It seems the error occurred right after downloading the final sequence, as the downloaded file size is identical between our outputs in the Terminal window.
With regards to the CRABS output file, please ensure it is correctly formatted after the error report you encountered. If incorrectly formatted, it might cause problems downstream. It seems CRABS is downloading 632,378 sequences from BOLD based on the code you ran, after removing sequences due to incorrect formatting, 601,327 sequences are written to the output file. The difference between the online BOLD search (594,413 sequences) and CRABS downloaded sequences could result from different settings between the online search tool and the ftp download function.
One bug I notice from the Terminal window output is that the reported number of filtered sequences due to incorrect formatting is not matching the actual sequences that are removed. This will not affect your results though, but I will fix this in the next version.
Please let me know if you have any further questions.
Best, Gert-Jan
Damn, for some reason I cannot move forward with this. I've tried repeating the command several times with no luck. I've also checked with the help desk from the cluster, and strangely, the IT person was able to complete the download on their account :(
What version of python are you using? Mine is Python 3.6.15.
Since another account on the cluster managed to complete the download without issues, I doubt the python version will be the issue. I'm using 3.11.5.
Does this problem persists for other downloads as well, or only BOLD, or only BOLD + Chondrichthyes?
Before, I was able to run the NCBI download with no problems. Strangely, I did a test with BOLD using 'Mammalia' as the database, and I was able to download it with no errors 🤔
Apologies, not sure what could cause the problem, as we're not able to recreate it. Since it is only that particular download, could you transfer the file from the IT person?
No worries! Thanks for the help, though. If I figure it out I'll let you know.
Hello,
I am running crabs v.0.1.8 on a Linux HPC. I was trying to download sequences from BOLD using the following command:
(crabs) [rturba@n1935 eDNA]$ crabs-v8 db_download --source bold --database 'Chordata' --output CO1_bold.fasta --keep_original yes --marker 'COI-5P'
However, I'm receiving a coding error. Has anyone encountered this issue before? It seems like the issue would be with the encoding in the database, maybe with some species name. Is there a way I can work around this issue on my end?