RefSeq FTP URL is not at right level

taltman commented 9 years ago

Docs should point to this URL:

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/

This should go before the docs explaining how to extract the files.

gwilymh commented 9 years ago

There are no files called refseq_protein.[01-N].tar.gz as described in the FAQ's for question 16: Which reference protein databases should I use? Where do I get these databases from?

Which protein files are recommended for use?

nielshanson commented 9 years ago

Hi there,

Looks like some of the naming conventions have been changed since that writeup was made

ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/

The files that are complete.xxx.protein.faa.gz represent the complete set of genomic proteins at the NCBI. While complete.nonredundant_protein.xxx.protein.faa.gz are a reduced “non-redudant” set commonly referred to “NR”. Read the README files on the NCBI ftp for more information about the release.

There’s a lot of files so I recommend looking into downloading using ftp from the command line: http://tecadmin.net/download-upload-files-using-ftp-command-line/

The following commands should suffice: ftp ftp://ftp.ncbi.nlm.nih.gov/refseq/release/complete/ put complete.*.protein.faa.gz

MetaPathways is compatible with both these databases, but we require that you name your final input file of sequences with “refseq” somewhere so that MetaPathways knows to extra taxonomy information as well as functional information.

Niels

On Aug 24, 2015, at 11:22 AM, Gwilym Haynes notifications@github.com<mailto:notifications@github.com> wrote:

There are no files called refseq_protein.[01-N].tar.gz as described in the FAQ's for question 16: Which reference protein databases should I use? Where do I get these databases from?

Which protein files are recommended for use?

— Reply to this email directly or view it on GitHubhttps://github.com/hallamlab/metapathways2/issues/59#issuecomment-134325163.

hallamlab / metapathways2

RefSeq FTP URL is not at right level #59