Closed gaofeng-ni closed 3 years ago
Hello, we don't currently have a precompiled DIAMOND database for Archaea, but it should be relatively straightforward to assemble. You would just download the appropriate protein FASTA files (from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/archaea ), cat them together, and then use the diamond makedb
command.
You could even combine Archaea and Bacteria by combining the files from both sources using the above method.
One challenge with the UniRef database is that the description line is in a different format, which means that parsing won't necessarily work for the downstream Python scripts. A possible enhancement could be to create a second set that can parse the UniRef headers.
Hi again,
Thanks for the tips. Just a follow-up, previously, I actually have downloaded an NCBI nr database from https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ , and bulit an diamond db from the downloaded nr.gz file.
diamond makedb --in nr.gz -d nr
Can this file nr.dmnd
directly work for SAMSA2? I believe nr is part of the refseq framework, right? https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/
Many thanks, Gaofeng
Hi Gaofeng,
Yes, this file should work if the sequence header lines follow the same format, which looks like:
>WP_057199767.1 MULTISPECIES: bifunctional aconitate hydratase 2/2-methylisocitrate dehydratase [Acidovorax]
This corresponds to:
>sequence_ID function_of_sequence [organism]
Hi there,
Just wondering if there's pre-compiled diamond db for Archaea in refseq? Maybe it's just a matter of downloading the entire refseq db and run
diamond makedb
?I'm working with a microbial community with both bacteria and archaea, wondering about the best approach. I guess the UniRef databases (link) can work as well?
Thanks!!