Working with Archaea - Githubissues

transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.

GNU General Public License v3.0

53 stars 36 forks source link

Working with Archaea #59

Closed gaofeng-ni closed 3 years ago

gaofeng-ni commented 3 years ago

Hi there,

Just wondering if there's pre-compiled diamond db for Archaea in refseq? Maybe it's just a matter of downloading the entire refseq db and run diamond makedb?

I'm working with a microbial community with both bacteria and archaea, wondering about the best approach. I guess the UniRef databases (link) can work as well?

Thanks!!

transcript commented 3 years ago

Hello, we don't currently have a precompiled DIAMOND database for Archaea, but it should be relatively straightforward to assemble. You would just download the appropriate protein FASTA files (from ftp://ftp.ncbi.nlm.nih.gov/refseq/release/archaea ), cat them together, and then use the diamond makedb command.

You could even combine Archaea and Bacteria by combining the files from both sources using the above method.

One challenge with the UniRef database is that the description line is in a different format, which means that parsing won't necessarily work for the downstream Python scripts. A possible enhancement could be to create a second set that can parse the UniRef headers.

gaofeng-ni commented 3 years ago

Hi again,

Thanks for the tips. Just a follow-up, previously, I actually have downloaded an NCBI nr database from https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/ , and bulit an diamond db from the downloaded nr.gz file.

diamond makedb --in nr.gz -d nr

Can this file nr.dmnd directly work for SAMSA2? I believe nr is part of the refseq framework, right? https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/

Many thanks, Gaofeng

transcript commented 3 years ago

Hi Gaofeng,

Yes, this file should work if the sequence header lines follow the same format, which looks like:

>WP_057199767.1 MULTISPECIES: bifunctional aconitate hydratase 2/2-methylisocitrate dehydratase [Acidovorax]

This corresponds to: >sequence_ID function_of_sequence [organism]