Open AlohaPropolis opened 1 year ago
Hi, this works fine for me (see below). It also seems like a lot of the steps work for you. Could you check the permissions on the queries directory and/or if there is already a file P01349.fsa in the directory?
You do not need a cloud account to run ncbi/blast docker.
Tom
madden@eb-1524-blastx:~/DOCKER_COMPLAINT$ docker run --rm ncbi/blast efetch -db protein -format fasta -id P01349 > queries/P01349.fsa madden@eb-1524-blastx:~/DOCKER_COMPLAINT$ ls -l queries/ total 4 -rw-rw-r-- 1 madden madden 173 May 8 14:17 P01349.fsa madden@eb-1524-blastx:~/DOCKER_COMPLAINT$ ls -ld queries/ drwxrwxr-x 2 madden madden 4096 May 8 14:17 queries/
I can use the image to create the relative directories and download the example database but I can not download the query db. $ docker run --rm ncbi/blast efetch -db protein -format fasta \ -id P01349 > queries/P01349.fsa bash: queries/P01349.fsa: Permission denied If the problem is related to not having a cloud account, then my question is why must I have a paid account in order to use the ncbi/blast docker image? Thanks
However the following works the way I like!
docker ps
docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6437c2fc9c4f 4afb1f585a96 "bash" 4 minutes ago Up 4 minutes friendly_swartz
docker exec -it friendly_swartz bash root@6437c2fc9c4f:/blast# ls bin blastdb blastdb_custom lib root@6437c2fc9c4f:/blast# update_blastdb.pl --showall pretty --source gcp
Connected to GCP BLASTDB DESCRIPTION SIZE (GB) LAST_UPDATED swissprot Non-redundant UniProtKB/SwissProt sequences 0.3573 2023-04-29 nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects 364.0284 2023-04-27 refseq_protein NCBI Protein Reference Sequences 144.4754 2023-05-05 landmark Landmark database for SmartBLAST 0.3817 2023-04-25 pdbaa PDB protein database 0.1951 2023-04-29 nt Nucleotide collection (nt) 303.7546 2023-04-30 pdbnt PDB nucleotide database 0.0143 2023-04-23 patnt Nucleotide sequences derived from the Patent division of GenBank 15.7333 2023-04-28 refseq_rna NCBI Transcript Reference Sequences 46.6038 2023-05-01 ref_prok_rep_genomes Refseq prokaryote representative genomes (contains refseq assembly) 19.6809 2023-04-29 ref_viruses_rep_genomes Refseq viruses representative genomes 0.1320 2023-04-29 ref_viroids_rep_genomes Refseq viroids representative genomes 0.0001 2022-06-25 ref_euk_rep_genomes RefSeq Eukaryotic Representative Genome Database 350.4509 2023-04-13 split-cdd CDD split into 32 volumes 4.5709 2022-12-18 cdd CDD.v3.20 3.7088 2022-09-21 GCF_000001405.39_top_level Homo sapiens GRCh38.p13 [GCF_000001405.39] chromosomes plus unplaced and unlocalized scaffolds 1.1572 2021-06-02 GCF_000001635.27_top_level Mus musculus GRCm39 [GCF_000001635.27] chromosomes plus unplaced and unlocalized scaffolds 3.6543 2021-06-02 16S_ribosomal_RNA 16S ribosomal RNA (Bacteria and Archaea type strains) 0.0179 2023-04-15 18S_fungal_sequences 18S ribosomal RNA sequences (SSU) from Fungi type and reference material 0.0023 2023-05-04 28S_fungal_sequences 28S ribosomal RNA sequences (LSU) from Fungi type and reference material 0.0053 2023-05-04 ITS_RefSeq_Fungi Internal transcribed spacer region (ITS) from Fungi type and reference material 0.0067 2022-10-28 ITS_eukaryote_sequences ITS eukaryote BLAST 0.0331 2023-05-01 env_nt environmental samples 48.8039 2023-04-05 Betacoronavirus Betacoronavirus 54.0961 2023-05-06 pataa Protein sequences derived from the Patent division of GenBank 1.8011 2023-04-30 refseq_select_prot RefSeq Select proteins 34.3461 2023-04-30 refseq_select_rna RefSeq Select RNA sequences 0.0656 2023-04-30 env_nr Proteins from WGS metagenomic projects (env_nr). 3.9459 2023-04-30 LSU_eukaryote_rRNA Large subunit ribosomal nucleic acid for Eukaryotes 0.0053 2022-12-05 LSU_prokaryote_rRNA Large subunit ribosomal nucleic acid for Prokaryotes 0.0041 2022-12-05 SSU_eukaryote_rRNA Small subunit ribosomal nucleic acid for Eukaryotes 0.0063 2022-12-05 mito NCBI Genomic Mitochondrial Reference Sequences 0.1252 2023-04-20 tsa_nr Transcriptome Shotgun Assembly (TSA) sequences 5.1253 2023-04-30 tsa_nt Transcriptome Shotgun Assembly (TSA) sequences 6.3491 2023-04-27 nt_euk Eukaryota nt 197.6799 2023-04-26 nt_prok Prokaryota (bacteria and archaea) nt 51.1781 2023-05-01 nt_viruses Viruses nt 51.2685 2023-05-01 nt_others Artificial and other seqs nt 0.7473 2023-05-01 taxdb Taxonomy database 0.1670 2021-06-07
root@6437c2fc9c4f:/blast# efetch -db protein -format fasta \ -id P01349 > queries/P01349.fsa
root@6437c2fc9c4f:/blast# ls bin blastdb blastdb_custom fasta lib queries results
root@6437c2fc9c4f:/blast# cd queries
root@6437c2fc9c4f:/blast/queries# ls P01349.fsa
root@6437c2fc9c4f:/blast/queries# cd ..
root@6437c2fc9c4f:/blast# efetch -db protein -format fasta \ -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \
root@6437c2fc9c4f:/blast# ls bin blastdb blastdb_custom fasta lib queries results
root@6437c2fc9c4f:/blast# makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \ -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \ -taxid 7801 -blastdb_version 5
Building a new DB, current time: 05/07/2023 05:51:28 New DB name: /blast/nurse-shark-proteins New DB title: Nurse shark proteins Sequence type: Protein Keep MBits: T Maximum file size: 3000000000B Adding sequences from FASTA; added 7 sequences in 0.000834227 seconds.
root@6437c2fc9c4f:/blast# blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T" Q90523.1 106 7801 P80049.1 132 7801 P83981.1 53 7801 P83977.1 95 7801 P83984.1 190 7801 P83985.1 195 7801 P27950.1 151 7801
root@6437c2fc9c4f:/blast# blastdbcmd -list /blast/blastdb -remove_redundant_dbs ##### this does not work
root@6437c2fc9c4f:/blast# blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \ -out /blast/results/blastp.out
root@6437c2fc9c4f:/blast# cd results
root@6437c2fc9c4f:/blast/results# ls blastp.out
root@6437c2fc9c4f:/blast/results# cat blastp.out BLASTP 2.14.0+
Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005.
Database: Nurse shark proteins 7 sequences; 922 total letters
Query= sp|P01349.2|RELX_CARTA RecName: Full=Relaxin; Contains: RecName: Full=Relaxin B chain; Contains: RecName: Full=Relaxin A chain
Length=44 Score E Sequences producing significant alignments: (Bits) Value
P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName... 14.2 0.96
Score = 14.2 bits (25), Expect = 0.96, Method: Compositional matrix adjust. Identities = 3/9 (33%), Positives = 6/9 (67%), Gaps = 0/9 (0%)
Query 2 LCGRGFIRA 10 +C R ++R Sbjct 123 VCTREYVRE 131
Lambda K H a alpha 0.334 0.143 0.520 0.792 4.96
Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6
Effective search space used: 22680
Database: Nurse shark proteins Posted date: May 7, 2023 5:51 AM Number of letters in database: 922 Number of sequences in database: 7
Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Neighboring words threshold: 11 Window for multiple hits: 40
root@6437c2fc9c4f:/blast/results#