gjeunen / reference_database_creator

creating reference databases for amplicon sequencing
MIT License
21 stars 8 forks source link

Db completeness and taxonomy #17

Closed marymcelroy closed 9 months ago

marymcelroy commented 1 year ago

Hi Hugh and GJ, great to see this program was published recently - congratulations!

I'm having some trouble with the db completeness feature. Even though crabs successfully used my species.txt list for db_subset, for some reason it 'cannot find tax ids' for about ~20% of the species in same file when I run db_completeness. My species.txt file includes only accepted scientific names from WoRMS generated with taxize. The error mentions either a spelling mistake or synonym:

crabs visualization --method db_completeness --input 18s_pga_subset.tsv --output 18s_pga_dbcomplete.txt --species ca_species.txt --taxid nodes.dmp --name names.dmp

found 901 species of interest in ca_species.txt: ['Abarenicola_pacifica', 'Acanthina_paucilirata', 'Acanthina_punctulata', 'Acanthina_spirata'...`

generating taxonomic lineage for 901 species
converting names.dmp to dictionary
converting nodes.dmp to dictionary
did not find a taxonomic ID for Acanthina_paucilirata, please check for spelling mistakes, or synonym names.
did not find a taxonomic ID for Acanthina_punctulata, please check for spelling mistakes, or synonym names...

gathering data for 718 species

The resultant txt file only includes output for 718/901 species. I have another species.txt file that includes all the synonyms from WoRMS matching my accepted species names (4000+ names), but I'm hesitant to use that because even with 900 species, this step takes quite a while to run.

gjeunen commented 1 year ago

Hello @marymcelroy,

Thank you very much for your message!

The most likely culprit here is that for the species CRABS couldn't find a tax ID, none exists. I'll change the print statement in the following weeks to reflect this possibility. CRABS analysis is based on NCBI taxonomy, and if there isn't any sequencing data for a species, it most likely won't have an NCBI taxonomic ID. I checked the two species you mentioned above on the NCBI website and couldn't find a record (Acanthina_paucilirata, Acanthina_punctulata). However, I could find a record for Abarenicola_pacifica which isn't mentioned in the omission print statement of the function above. I'm not sure what the process is of creating new tax ID's on NCBI, but most likely this will need to be associated with a sequence on GenBank?

An alternative for CRABS would be to implement and use the WORMS taxonomy. However, this would not work for other organisms, as I understand this database is limited to marine species? I'm happy to take suggestions on other data sources that are available and more complete than NCBI. I'll see if those can be implemented in CRABS in a future update.

I'm currently working on a more efficient code for this function to reduce computational time. I hope to implement this in the coming weeks. This should allow you to run the 4000+ names without issue based on initial testing.

Best regards, Gert-Jan

marymcelroy commented 1 year ago

Oh, okay! That makes sense, thank you for clarifying. No problem - I just wanted to make sure I hadn't done something wrong. I'll just interpret the output as there being no sequence information for those species (at least by the accepted scientific name I'm using) in NCBI. I guess it's possible there are NCBI sequences attached to a synonym, but that would take some time to figure out with so many potential synonyms. I'd love to try the new function when it's ready!

And yes - WoRMS is limited to marine species, but with taxize there are lots of different taxonomy sources (including ncbi). I don't use the other databases enough to know how complete they are relative to NCBI, but this seems like a good place to start if you're looking for ideas.

gjeunen commented 1 year ago

Hello @marymcelroy,

Thanks for the suggestion! I'll have a look at these databases.

Another option would be to generate the info from the genus name, rather than the species name. This would reduce the number of taxa omitted. The only ones not included in the analysis would be the ones where no sequence data is available for any species within the genus. I'll try to see if this works in the coming weeks.

Best regards, Gert-Jan