Closed taltman closed 7 years ago
There are no files called refseq_protein.[01-N].tar.gz as described in the FAQ's for question 16: Which reference protein databases should I use? Where do I get these databases from?
Which protein files are recommended for use?
Hi there,
Looks like some of the naming conventions have been changed since that writeup was made
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/
The files that are complete.xxx.protein.faa.gz represent the complete set of genomic proteins at the NCBI. While complete.nonredundant_protein.xxx.protein.faa.gz are a reduced “non-redudant” set commonly referred to “NR”. Read the README files on the NCBI ftp for more information about the release.
There’s a lot of files so I recommend looking into downloading using ftp from the command line: http://tecadmin.net/download-upload-files-using-ftp-command-line/
The following commands should suffice: ftp ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/ put complete.*.protein.faa.gz
MetaPathways is compatible with both these databases, but we require that you name your final input file of sequences with “refseq” somewhere so that MetaPathways knows to extra taxonomy information as well as functional information.
Niels
On Aug 24, 2015, at 11:22 AM, Gwilym Haynes notifications@github.com<mailto:notifications@github.com> wrote:
There are no files called refseq_protein.[01-N].tar.gz as described in the FAQ's for question 16: Which reference protein databases should I use? Where do I get these databases from?
Which protein files are recommended for use?
— Reply to this email directly or view it on GitHubhttps://github.com/hallamlab/metapathways2/issues/59#issuecomment-134325163.
Docs should point to this URL:
ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/
This should go before the docs explaining how to extract the files.