Closed timz0605 closed 1 month ago
To follow up, I am encountering issues with importing the database even though the fasta file is downloaded from BOLD itself...
crabs db_import -i BOLD_Public.26-Jan-2024.fasta -o BOLD_database.fasta -s BOLD
formatting BOLD downloaded sequences to CRABS format
85%|███████████████████████████████████████████▏ | 6189134698/7319208542 [03:04<00:33, 33549554.93it/s]
found 9734019 sequences in BOLD_Public.26-Jan-2024.fasta
found 9734019 sequences with incorrect format
written 0 sequences to BOLD_database.fasta
Hello @timz0605,
Thank you for using CRABS!
Can you please provide me with the format of the file you would like to import (BOLD_Public.26-Jan-2024.fasta
)? Easiest will be to print the output from the head
command in the Terminal window.
Best, Gert-Jan
Hello and thank you for the response! Below is the output when I head -10
:
>AAASF001-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia cruciata,None
AACATTATATTTTATTTTTGGAGCCTGAGCAGGAATAGTGGGAACATCTTTAAGAATTTTAATTCGAGCAGAATTAGGTCACCCCGGTGCTTTAATTGGTGATGATCAAATTTATAATGTTATTGTTACAGCTCATGCATTTGTAATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGTAACTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCATTCCCTCGAATAAATAATATAAGATTTTGACTTTTACCCCCCTCTCTTACTCTCCTTCTTACAAGAAGTATAGTTGAAACTGGGGCAGGAACAGGATGAACTGTTTATCCACCTCTTTCAAGAAATATTGCCCATAGAGGAGCTTCTGTTGATTTAGCAATTTTTTCCCTACATTTAGCCGGGATTTCATCTATTCTTGGAGCAGTAAATTTTATTACTACAGTTATTAATATACGATCTGCTGGAATTACATTAGATCGAATACCTTTATTTGTTTGATCTGTAATAATTACTGCGGTACTTCTATTATTATCATTACCTGTTTTAGCAGGTGCAATTACAATACTTCTAACTGATCGTAATCTAAATACTTCTTTTTTTGACCCTGCGGGAGGTGGGGATCCAATTTTATATCAACATTTATTT
>AAASF004-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia longipalpis,None
GACTTTATATTTTATTTTCGGGGCTTGATCTGGAATAGTAGGGACATCCTTAAGAATTTTAATTCGAGCTGAACTCGGGCATCCTGGAGCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCCCATGCTTTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGATTCGGAAATTGATTAGTTCCTTTAATATTAGGGGCCCCTGATATAGCTTTTCCTCGAATAAATAATATAAGATTCTGACTTTTACCTCCATCTTTAACTTTATTATTAACTAGAAGTATAGTAGAAACTGGAGCAGGAACAGGTTGAACTGTCTACCCACCTTTATCTAGAAATATTGCCCATAGAGGAGCTTCAGTTGATTTAGCAATTTTTTCCCTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCAGTAAATTTTATTACTACAGTAATTAATATGCGATCAACAGGAATTACTTTAGACCGAATACCATTATTTGTCTGATCTGTCGTAATTACTGCAGTTCTTTTATTATTATCTCTCCCTGTTCTAGCAGGAGCTATTACTATACTTTTAACTGATCGAAATCTAAATACTTCTTTTTTTGATCCTGCTGGAGGTGGTGACCCCATTTTATACCAGCACTTATTT
>AAASF005-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia longipalpis,None
GACTTTATATTTTATTTTCGGGGCTTGATCTGGAATAGTAGGGACATCCTTAAGAATTTTAATTCGAGCTGAACTCGGGCATCCTGGAGCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCCCATGCTTTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGATTCGGAAATTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCTTTTCCTCGAATAAATAATATAAGATTCTGACTTTTACCTCCATCTTTAACTTTATTATTAACTAGAAGTATAGTAGAAACTGGAGCAGGAACAGGTTGAACTGTCTACCCACCTTTATCTAGAAATATTGCCCATAGAGGAGCTTCAGTTGATTTAGCAATTTTTTCCCTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCAGTAAATTTTATTACTACAGTAATTAATATGCGATCAACAGGAATTACTTTAGACCGAATACCATTATTTGTCTGATCTGTCGTAATTACTGCAGTTCTTTTATTATTATCTCTCCCTGTTCTAGCAGGAGCTATTACTATACTTTTAACTGATCGAAATCTAAATACTTCTTTTTTTGATCCTGCTGGAGGTGGTGACCCCATTTTATACCAACACTTATTT
>AAASF006-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia longipalpis,None
GACTTTATATTTTATTTTCGGGGCTTGATCTGGAATAGTAGGGACATCCTTAAGAATTTTAATTCGAGCTGAACTCGGGCATCCTGGAGCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCCCATGCTTTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGATTCGGAAATTGATTAGTTCCTTTAATATTAGGGGCCCCTGATATAGCTTTTCCTCGAATAAATAATATAAGATTCTGGCTTTTACCTCCATCTTTAACTTTATTATTAACTAGAAGTATAGTAGAAACTGGGGCAGGAACAGGTTGAACTGTCTACCCACCTTTATCTAGAAATATTGCCCATAGAGGAGCTTCAGTTGATTTAGCAATTTTTTCCCTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCAGTAAATTTTATTACTACAGTAATTAATATGCGATCAACAGGAATTACTTTAGACCGAATACCATTATTTGTCTGATCTGTCGTAATTACTGCAGTTCTTTTATTATTATCTCTCCCTGTTCTAGCAGGAGCTATTACTATACTTTTAACTGATCGAAATCTAAATACTTCTTTTTTTGATCCTGCTGGAGGTGGTGACCCCATTTTATACCAGCACTTATTT
>AAASF007-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia longipalpis,None
GACTTTATATTTTATTTTCGGGGCTTGATCTGGAATAGTAGGGACATCCTTAAGAATTTTAATTCGAGCTGAACTCGGGCATCCTGGAGCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCCCATGCTTTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGATTCGGAAATTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCTTTTCCTCGAATAAATAATATAAGATTCTGACTTTTACCTCCATCTTTAACTTTATTATTAACTAGAAGTATAGTAGAAACTGGAGCAGGAACAGGTTGAACTGTCTACCCACCTTTATCTAGAAATATTGCCCATAGAGGAGCTTCAGTTGATTTAGCAATTTTTTCCCTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCAGTAAATTTTATTACTACAGTAATTAATATGCGATCAACAGGAATTACTTTAGACCGAATACCATTATTTGTCTGATCTGTCGTAATTACTGCAGTTCTTTTATTATTATCTCTCCCTGTTCTAGCAGGAGCTATTACTATACTTTTAACTGATCGAAATCTAAATACTTCTTTTTTTGATCCTGCTGGAGGTGGTGACCCCATTTTATACCAGCACTTATTT
Those were downloaded through this website: https://boldsystems.org/index.php/datapackages
Hello @timz0605,
The reason no sequences are written to the output file is because you have ,
in your ID. Your ID is identified as Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia cruciata,None
for sequence 1. This should be reformatted to Lutzomyia cruciata
for crabs db_import
to work.
Best, Gert-Jan
Hi @gjeunen,
So, should I keep both the BOLD accession number, gene region, and location info at the front, but only keep the species name at the end? For example. using few commands to only keep the species name and delete Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,
for crabs db_import
to work?
Besides that, I was wondering how crabs deal with multiple sequences for the same species, and also samples do now have low enough taxonomic identification (e.g., information only available on family or order level)
Thank you!
Hello @timz0605,
Please only change Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia cruciata,None
to Lutzomyia cruciata
. The rest of the structure within the header can stay the same.
Best, Gert-Jan
Hello! I am working on a metabarcoding project for marine samples, and I am trying to build my reference database for taxonomic assignment. I would like to download all COI sequences for eukaryotes from BOLD. However, I am having some difficulties trying that approach through the sample code provided. Could you tell me if that's something doable through crabs?
The code I was trying to use
I have to say that the search function on BOLD is difficult to navigate, as they could not recognize words such as "eukaryote" or "animal/metazoa" like GenBank does. Another approach I can currently think of is, download all available sequences from BOLD, do some filtering or subsetting on my end, e.g., filter out insect species since they do not live in the ocean. After that, I could try the import function in crabs to import the BOLD system data. However, it seems that this code does not work either...