sandialabs / TIGER

Target / Integrative Genetic Element Retriever: precisely maps IGEs (a defined type of genomic island) in bacterial and archaeal genomes; package also includes orthogonal program Islander
Other
10 stars 3 forks source link

refseq_genomic deprecated #2

Closed lxsteiner closed 4 years ago

lxsteiner commented 4 years ago

From https://github.com/sandialabs/TIGER/blob/master/README.md#installation

User should have a reference genome blast database available, such as refseq_genomic

however refseq_genomic has been deprecated with the current database release:

As of February 4, 2020, the BLAST databases on the FTP site are version 5 (v5)

Current ones are:

+-----------------------------+------------------------------------------------+
 File Name                    | Content Description                           
+-----------------------------+------------------------------------------------+
README                        | README for this subdirectory (this file)
nr.*tar.gz                    | Non-redundant protein sequences from GenPept, 
                                Swissprot, PIR, PDF, PDB, and NCBI RefSeq
nt.*tar.gz                    | Partially non-redundant nucleotide sequences from 
                                all traditional divisions of GenBank, EMBL, and DDBJ 
                                excluding GSS,STS, PAT, EST, HTG, and WGS.
landmark.tar.gz               | Proteome of 27 model organisms, see 
                                https://blast.ncbi.nlm.nih.gov/smartblast/smartBlast.cgi?CMD=Web&PAGE_TYPE=BlastDocs#searchSets
16S_ribosomal_RNA             | 16S ribosomal RNA (Bacteria and Archaea type strains)
18S_fungal_sequences.tar.gz   | 18S ribosomal RNA sequences (SSU) from Fungi type and reference material (BioProject PRJNA39195) 
28S_fungal_sequences.tar.gz   | 28S ribosomal RNA sequences (LSU) from Fungi type and reference material (BioProject PRJNA51803)
ITS_RefSeq_Fungi.tar.gz       | Internal transcribed spacer region (ITS) from Fungi type and reference material (BioProject PRJNA177353)
ITS_eukaryote_sequences.tar.gz| Internal transcribed spacer region (ITS) for eukaryotic sequences
LSU_eukaryote_rRNA.tar.gz     | Large subunit ribosomal RNA sequences for eukaryotic sequences
LSU_prokaryote_rRNA.tar.gz    | Large subunit ribosomal RNA sequences for prokaryotic sequences
SSU_eukaryote_rRNA.tar.gz     | Small subunit ribosomal RNA sequences for eukaryotic sequences
ref_euk_rep_genomes*tar.gz    | Refseq Representative Eukaryotic genomes (1000+ organisms)
ref_prok_rep_genomes*tar.gz   | Refseq Representative Prokaryotic genomes (5700+ organisms)
ref_viruses_rep_genomes*tar.gz   | Refseq Representative Virus genomes (9000+ organisms)
ref_viroids_rep_genomes*tar.gz   | Refseq Representative Viroid genomes (46 organisms)
refseq_protein.*tar.gz        | NCBI protein reference sequences
refseq_rna.*tar.gz            | NCBI Transcript reference sequences
swissprot.tar.gz              | Swiss-Prot sequence database (last major update)
pataa.*tar.gz                 | Patent protein sequences
patnt.*tar.gz                 | Patent nucleotide sequences. Both patent databases
                                are directly from the USPTO, or from the EPO/JPO 
                                via EMBL/DDBJ
pdbaa.*tar.gz                 | Sequences for the protein structure from the 
                                Protein Data Bank
pdbnt.*tar.gz                 | Sequences for the nucleotide structure from the 
                                Protein Data Bank. They are NOT the protein coding
                                sequences for the corresponding pdbaa entries.
taxdb.tar.gz                  | Additional taxonomy information for the databases 
                                listed here  providing common and scientific names
FASTA/                        | Subdirectory for FASTA formatted sequences
v4/                           | BLAST databases in version 4 (v4).  These files are no
                                longer being updated.
cloud/                        | Subdirectory of databases for BLAST AMI; see
                                http://1.usa.gov/TJAnEt
+-----------------------------+------------------------------------------------+

which would you recommend instead? Is viroids+viruses+prok enough or do we need to include eukaryotes as well?

Thanks, Leon

kpwilliams commented 4 years ago

This is the closest: ref_prok_rep_genomes*tar.gz I don't like that it's only "5700+" organisms; ~50000 would be better. Hopefully if you're working with a well sequenced genus (eg a pathogen), there will be ~100 genomes/genus. But euks and viruses won't help.