gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
24 stars 8 forks source link

Download all eukaryote COI sequences from BOLD #47

Open timz0605 opened 7 months ago

timz0605 commented 7 months ago

Hello! I am working on a metabarcoding project for marine samples, and I am trying to build my reference database for taxonomic assignment. I would like to download all COI sequences for eukaryotes from BOLD. However, I am having some difficulties trying that approach through the sample code provided. Could you tell me if that's something doable through crabs?

The code I was trying to use

crabs db_download --source bold --output bold.fasta --keep_original yes --boldgap DISCARD --marker 'COI-5P'

I have to say that the search function on BOLD is difficult to navigate, as they could not recognize words such as "eukaryote" or "animal/metazoa" like GenBank does. Another approach I can currently think of is, download all available sequences from BOLD, do some filtering or subsetting on my end, e.g., filter out insect species since they do not live in the ocean. After that, I could try the import function in crabs to import the BOLD system data. However, it seems that this code does not work either...

crabs db_import --input BOLD_COI.fasta --output BOLD_COI_crabs.fasta --seq_header BOLD

formatting BOLD downloaded sequences to CRABS format
 84%|███████████████████████████████████████████████████▌         | 1144645505/1354666231 [01:24<00:15, 13579838.83it/s]
found 1723408 sequences in BOLD_COI.fasta
found 1723408 sequences with incorrect format
written 0 sequences to BOLD_COI_crabs.fasta
timz0605 commented 7 months ago

To follow up, I am encountering issues with importing the database even though the fasta file is downloaded from BOLD itself...

crabs db_import -i BOLD_Public.26-Jan-2024.fasta -o BOLD_database.fasta -s BOLD

formatting BOLD downloaded sequences to CRABS format
 85%|███████████████████████████████████████████▏       | 6189134698/7319208542 [03:04<00:33, 33549554.93it/s]
found 9734019 sequences in BOLD_Public.26-Jan-2024.fasta
found 9734019 sequences with incorrect format
written 0 sequences to BOLD_database.fasta
gjeunen commented 7 months ago

Hello @timz0605,

Thank you for using CRABS!

Can you please provide me with the format of the file you would like to import (BOLD_Public.26-Jan-2024.fasta)? Easiest will be to print the output from the head command in the Terminal window.

Best, Gert-Jan

timz0605 commented 7 months ago

Hello and thank you for the response! Below is the output when I head -10:

>AAASF001-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia cruciata,None
AACATTATATTTTATTTTTGGAGCCTGAGCAGGAATAGTGGGAACATCTTTAAGAATTTTAATTCGAGCAGAATTAGGTCACCCCGGTGCTTTAATTGGTGATGATCAAATTTATAATGTTATTGTTACAGCTCATGCATTTGTAATAATTTTTTTTATAGTTATACCTATTATAATTGGAGGATTTGGTAACTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCATTCCCTCGAATAAATAATATAAGATTTTGACTTTTACCCCCCTCTCTTACTCTCCTTCTTACAAGAAGTATAGTTGAAACTGGGGCAGGAACAGGATGAACTGTTTATCCACCTCTTTCAAGAAATATTGCCCATAGAGGAGCTTCTGTTGATTTAGCAATTTTTTCCCTACATTTAGCCGGGATTTCATCTATTCTTGGAGCAGTAAATTTTATTACTACAGTTATTAATATACGATCTGCTGGAATTACATTAGATCGAATACCTTTATTTGTTTGATCTGTAATAATTACTGCGGTACTTCTATTATTATCATTACCTGTTTTAGCAGGTGCAATTACAATACTTCTAACTGATCGTAATCTAAATACTTCTTTTTTTGACCCTGCGGGAGGTGGGGATCCAATTTTATATCAACATTTATTT
>AAASF004-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia longipalpis,None
GACTTTATATTTTATTTTCGGGGCTTGATCTGGAATAGTAGGGACATCCTTAAGAATTTTAATTCGAGCTGAACTCGGGCATCCTGGAGCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCCCATGCTTTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGATTCGGAAATTGATTAGTTCCTTTAATATTAGGGGCCCCTGATATAGCTTTTCCTCGAATAAATAATATAAGATTCTGACTTTTACCTCCATCTTTAACTTTATTATTAACTAGAAGTATAGTAGAAACTGGAGCAGGAACAGGTTGAACTGTCTACCCACCTTTATCTAGAAATATTGCCCATAGAGGAGCTTCAGTTGATTTAGCAATTTTTTCCCTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCAGTAAATTTTATTACTACAGTAATTAATATGCGATCAACAGGAATTACTTTAGACCGAATACCATTATTTGTCTGATCTGTCGTAATTACTGCAGTTCTTTTATTATTATCTCTCCCTGTTCTAGCAGGAGCTATTACTATACTTTTAACTGATCGAAATCTAAATACTTCTTTTTTTGATCCTGCTGGAGGTGGTGACCCCATTTTATACCAGCACTTATTT
>AAASF005-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia longipalpis,None
GACTTTATATTTTATTTTCGGGGCTTGATCTGGAATAGTAGGGACATCCTTAAGAATTTTAATTCGAGCTGAACTCGGGCATCCTGGAGCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCCCATGCTTTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGATTCGGAAATTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCTTTTCCTCGAATAAATAATATAAGATTCTGACTTTTACCTCCATCTTTAACTTTATTATTAACTAGAAGTATAGTAGAAACTGGAGCAGGAACAGGTTGAACTGTCTACCCACCTTTATCTAGAAATATTGCCCATAGAGGAGCTTCAGTTGATTTAGCAATTTTTTCCCTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCAGTAAATTTTATTACTACAGTAATTAATATGCGATCAACAGGAATTACTTTAGACCGAATACCATTATTTGTCTGATCTGTCGTAATTACTGCAGTTCTTTTATTATTATCTCTCCCTGTTCTAGCAGGAGCTATTACTATACTTTTAACTGATCGAAATCTAAATACTTCTTTTTTTGATCCTGCTGGAGGTGGTGACCCCATTTTATACCAACACTTATTT
>AAASF006-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia longipalpis,None
GACTTTATATTTTATTTTCGGGGCTTGATCTGGAATAGTAGGGACATCCTTAAGAATTTTAATTCGAGCTGAACTCGGGCATCCTGGAGCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCCCATGCTTTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGATTCGGAAATTGATTAGTTCCTTTAATATTAGGGGCCCCTGATATAGCTTTTCCTCGAATAAATAATATAAGATTCTGGCTTTTACCTCCATCTTTAACTTTATTATTAACTAGAAGTATAGTAGAAACTGGGGCAGGAACAGGTTGAACTGTCTACCCACCTTTATCTAGAAATATTGCCCATAGAGGAGCTTCAGTTGATTTAGCAATTTTTTCCCTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCAGTAAATTTTATTACTACAGTAATTAATATGCGATCAACAGGAATTACTTTAGACCGAATACCATTATTTGTCTGATCTGTCGTAATTACTGCAGTTCTTTTATTATTATCTCTCCCTGTTCTAGCAGGAGCTATTACTATACTTTTAACTGATCGAAATCTAAATACTTCTTTTTTTGATCCTGCTGGAGGTGGTGACCCCATTTTATACCAGCACTTATTT
>AAASF007-17|COI-5P|Mexico|Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia longipalpis,None
GACTTTATATTTTATTTTCGGGGCTTGATCTGGAATAGTAGGGACATCCTTAAGAATTTTAATTCGAGCTGAACTCGGGCATCCTGGAGCATTAATTGGTGATGATCAAATTTATAATGTAATTGTTACAGCCCATGCTTTTGTAATAATTTTTTTTATAGTAATACCTATCATAATTGGAGGATTCGGAAATTGATTAGTTCCTTTAATATTAGGAGCCCCTGATATAGCTTTTCCTCGAATAAATAATATAAGATTCTGACTTTTACCTCCATCTTTAACTTTATTATTAACTAGAAGTATAGTAGAAACTGGAGCAGGAACAGGTTGAACTGTCTACCCACCTTTATCTAGAAATATTGCCCATAGAGGAGCTTCAGTTGATTTAGCAATTTTTTCCCTTCATTTAGCTGGAATTTCATCTATTTTAGGAGCAGTAAATTTTATTACTACAGTAATTAATATGCGATCAACAGGAATTACTTTAGACCGAATACCATTATTTGTCTGATCTGTCGTAATTACTGCAGTTCTTTTATTATTATCTCTCCCTGTTCTAGCAGGAGCTATTACTATACTTTTAACTGATCGAAATCTAAATACTTCTTTTTTTGATCCTGCTGGAGGTGGTGACCCCATTTTATACCAGCACTTATTT

Those were downloaded through this website: https://boldsystems.org/index.php/datapackages

gjeunen commented 7 months ago

Hello @timz0605,

The reason no sequences are written to the output file is because you have , in your ID. Your ID is identified as Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia cruciata,None for sequence 1. This should be reformatted to Lutzomyia cruciata for crabs db_import to work.

Best, Gert-Jan

timz0605 commented 7 months ago

Hi @gjeunen,

So, should I keep both the BOLD accession number, gene region, and location info at the front, but only keep the species name at the end? For example. using few commands to only keep the species name and delete Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia, for crabs db_import to work?

Besides that, I was wondering how crabs deal with multiple sequences for the same species, and also samples do now have low enough taxonomic identification (e.g., information only available on family or order level)

Thank you!

gjeunen commented 7 months ago

Hello @timz0605,

Please only change Animalia,Arthropoda,Insecta,Diptera,Psychodidae,Phlebotominae,Lutzomyia,Lutzomyia cruciata,None to Lutzomyia cruciata. The rest of the structure within the header can stay the same.

Best, Gert-Jan