transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
54 stars 36 forks source link

How does the Refseq database supplied with samsa2 script differ to the one available online #47

Closed mradz19 closed 2 years ago

mradz19 commented 4 years ago

I have run the pipeline with two refseq databases, one that is downloaded with the samsa2 scripts: "https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/2c8s521xj9907hn/RefSeq_bac.fa

and the other I downloaded from: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

I get different results when using the two databases and I noticed the one from ncbi is twice the size, so what is the difference between the two?

transcript commented 4 years ago

Hi Michael,

When you download from NCBI, are you downloading the non-redundant proteins database, or the full bacteria one? The one provided with SAMSA2 is the non-redundant proteins.

Best, Sam

On Sat, May 9, 2020 at 12:26 AM Michael notifications@github.com wrote:

I have run the pipeline with two refseq databases, one that is downloaded with the samsa2 scripts: " https://bioshare.bioinformatics.ucdavis.edu/bioshare/download/2c8s521xj9907hn/RefSeq_bac.fa

and the other I downloaded from: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/bacteria/

I get different results when using the two databases and I noticed the one from ncbi is twice the size, so what is the difference between the two?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/transcript/samsa2/issues/47, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWPTVRBIN7G3DLETWIC7X3RQUAQXANCNFSM4M4V4VJQ .

-- Sam Westreich Microbiome Scientist, DNAnexus, http://www.mosaicbiome.com

mradz19 commented 4 years ago

Hi @transcript

I downloaded the entire database but only used the non-redundant faa files to make the diamond database.

Just curious how old is the one provided with SAMSA, is it updated regularly?

transcript commented 4 years ago

Hi Michael,

Interesting - I know that the version distributed with SAMSA2 is from 2017 (as that's when we released it), but I suspect it's not being updated regularly. I should probably write up instructions for users to do so, in case I can't consistently update.

On Mon, May 11, 2020 at 2:49 PM Michael notifications@github.com wrote:

Hi @transcript https://github.com/transcript

I downloaded the entire database but only used the non-redundant faa files to make the diamond database.

Just curious how old is the one provided with SAMSA, is it updated regularly?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/transcript/samsa2/issues/47#issuecomment-626984056, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABWPTVV5CGNPKL3OKBZGFT3RRBXG5ANCNFSM4M4V4VJQ .

-- Sam Westreich Microbiome Scientist, DNAnexus, http://www.mosaicbiome.com

mradz19 commented 4 years ago

Hi Sam,

It's weird, I definitely only used the non-redundant files. What is weird is the SAMSA fa file is 28GB where as the downloaded file is 58GB.