ncbi / blast_plus_docs

111 stars 32 forks source link

Using the docker image locally without a cloud account. #29

Open AlohaPropolis opened 1 year ago

AlohaPropolis commented 1 year ago

I can use the image to create the relative directories and download the example database but I can not download the query db. $ docker run --rm ncbi/blast efetch -db protein -format fasta \ -id P01349 > queries/P01349.fsa bash: queries/P01349.fsa: Permission denied If the problem is related to not having a cloud account, then my question is why must I have a paid account in order to use the ncbi/blast docker image? Thanks

However the following works the way I like!

docker ps

docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 6437c2fc9c4f 4afb1f585a96 "bash" 4 minutes ago Up 4 minutes friendly_swartz

docker exec -it friendly_swartz bash root@6437c2fc9c4f:/blast# ls bin blastdb blastdb_custom lib root@6437c2fc9c4f:/blast# update_blastdb.pl --showall pretty --source gcp

Connected to GCP BLASTDB DESCRIPTION SIZE (GB) LAST_UPDATED swissprot Non-redundant UniProtKB/SwissProt sequences 0.3573 2023-04-29 nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects 364.0284 2023-04-27 refseq_protein NCBI Protein Reference Sequences 144.4754 2023-05-05 landmark Landmark database for SmartBLAST 0.3817 2023-04-25 pdbaa PDB protein database 0.1951 2023-04-29 nt Nucleotide collection (nt) 303.7546 2023-04-30 pdbnt PDB nucleotide database 0.0143 2023-04-23 patnt Nucleotide sequences derived from the Patent division of GenBank 15.7333 2023-04-28 refseq_rna NCBI Transcript Reference Sequences 46.6038 2023-05-01 ref_prok_rep_genomes Refseq prokaryote representative genomes (contains refseq assembly) 19.6809 2023-04-29 ref_viruses_rep_genomes Refseq viruses representative genomes 0.1320 2023-04-29 ref_viroids_rep_genomes Refseq viroids representative genomes 0.0001 2022-06-25 ref_euk_rep_genomes RefSeq Eukaryotic Representative Genome Database 350.4509 2023-04-13 split-cdd CDD split into 32 volumes 4.5709 2022-12-18 cdd CDD.v3.20 3.7088 2022-09-21 GCF_000001405.39_top_level Homo sapiens GRCh38.p13 [GCF_000001405.39] chromosomes plus unplaced and unlocalized scaffolds 1.1572 2021-06-02 GCF_000001635.27_top_level Mus musculus GRCm39 [GCF_000001635.27] chromosomes plus unplaced and unlocalized scaffolds 3.6543 2021-06-02 16S_ribosomal_RNA 16S ribosomal RNA (Bacteria and Archaea type strains) 0.0179 2023-04-15 18S_fungal_sequences 18S ribosomal RNA sequences (SSU) from Fungi type and reference material 0.0023 2023-05-04 28S_fungal_sequences 28S ribosomal RNA sequences (LSU) from Fungi type and reference material 0.0053 2023-05-04 ITS_RefSeq_Fungi Internal transcribed spacer region (ITS) from Fungi type and reference material 0.0067 2022-10-28 ITS_eukaryote_sequences ITS eukaryote BLAST 0.0331 2023-05-01 env_nt environmental samples 48.8039 2023-04-05 Betacoronavirus Betacoronavirus 54.0961 2023-05-06 pataa Protein sequences derived from the Patent division of GenBank 1.8011 2023-04-30 refseq_select_prot RefSeq Select proteins 34.3461 2023-04-30 refseq_select_rna RefSeq Select RNA sequences 0.0656 2023-04-30 env_nr Proteins from WGS metagenomic projects (env_nr). 3.9459 2023-04-30 LSU_eukaryote_rRNA Large subunit ribosomal nucleic acid for Eukaryotes 0.0053 2022-12-05 LSU_prokaryote_rRNA Large subunit ribosomal nucleic acid for Prokaryotes 0.0041 2022-12-05 SSU_eukaryote_rRNA Small subunit ribosomal nucleic acid for Eukaryotes 0.0063 2022-12-05 mito NCBI Genomic Mitochondrial Reference Sequences 0.1252 2023-04-20 tsa_nr Transcriptome Shotgun Assembly (TSA) sequences 5.1253 2023-04-30 tsa_nt Transcriptome Shotgun Assembly (TSA) sequences 6.3491 2023-04-27 nt_euk Eukaryota nt 197.6799 2023-04-26 nt_prok Prokaryota (bacteria and archaea) nt 51.1781 2023-05-01 nt_viruses Viruses nt 51.2685 2023-05-01 nt_others Artificial and other seqs nt 0.7473 2023-05-01 taxdb Taxonomy database 0.1670 2021-06-07

root@6437c2fc9c4f:/blast# efetch -db protein -format fasta \ -id P01349 > queries/P01349.fsa

root@6437c2fc9c4f:/blast# ls bin blastdb blastdb_custom fasta lib queries results

root@6437c2fc9c4f:/blast# cd queries

root@6437c2fc9c4f:/blast/queries# ls P01349.fsa

root@6437c2fc9c4f:/blast/queries# cd ..

root@6437c2fc9c4f:/blast# efetch -db protein -format fasta \ -id Q90523,P80049,P83981,P83982,P83983,P83977,P83984,P83985,P27950 \

fasta/nurse-shark-proteins.fsa

root@6437c2fc9c4f:/blast# ls bin blastdb blastdb_custom fasta lib queries results

root@6437c2fc9c4f:/blast# makeblastdb -in /blast/fasta/nurse-shark-proteins.fsa -dbtype prot \ -parse_seqids -out nurse-shark-proteins -title "Nurse shark proteins" \ -taxid 7801 -blastdb_version 5

Building a new DB, current time: 05/07/2023 05:51:28 New DB name: /blast/nurse-shark-proteins New DB title: Nurse shark proteins Sequence type: Protein Keep MBits: T Maximum file size: 3000000000B Adding sequences from FASTA; added 7 sequences in 0.000834227 seconds.

root@6437c2fc9c4f:/blast# blastdbcmd -entry all -db nurse-shark-proteins -outfmt "%a %l %T" Q90523.1 106 7801 P80049.1 132 7801 P83981.1 53 7801 P83977.1 95 7801 P83984.1 190 7801 P83985.1 195 7801 P27950.1 151 7801

root@6437c2fc9c4f:/blast# blastdbcmd -list /blast/blastdb -remove_redundant_dbs ##### this does not work

root@6437c2fc9c4f:/blast# blastp -query /blast/queries/P01349.fsa -db nurse-shark-proteins \ -out /blast/results/blastp.out

root@6437c2fc9c4f:/blast# cd results

root@6437c2fc9c4f:/blast/results# ls blastp.out

root@6437c2fc9c4f:/blast/results# cat blastp.out BLASTP 2.14.0+

Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.

Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005.

Database: Nurse shark proteins 7 sequences; 922 total letters

Query= sp|P01349.2|RELX_CARTA RecName: Full=Relaxin; Contains: RecName: Full=Relaxin B chain; Contains: RecName: Full=Relaxin A chain

Length=44 Score E Sequences producing significant alignments: (Bits) Value

P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName... 14.2 0.96

P80049.1 RecName: Full=Fatty acid-binding protein, liver; AltName: Full=Liver-type fatty acid-binding protein; Short=L-FABP Length=132

Score = 14.2 bits (25), Expect = 0.96, Method: Compositional matrix adjust. Identities = 3/9 (33%), Positives = 6/9 (67%), Gaps = 0/9 (0%)

Query 2 LCGRGFIRA 10 +C R ++R Sbjct 123 VCTREYVRE 131

Lambda K H a alpha 0.334 0.143 0.520 0.792 4.96

Gapped Lambda K H a alpha sigma 0.267 0.0410 0.140 1.90 42.6 43.6

Effective search space used: 22680

Database: Nurse shark proteins Posted date: May 7, 2023 5:51 AM Number of letters in database: 922 Number of sequences in database: 7

Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Neighboring words threshold: 11 Window for multiple hits: 40

root@6437c2fc9c4f:/blast/results#

tom6931 commented 1 year ago

Hi, this works fine for me (see below). It also seems like a lot of the steps work for you. Could you check the permissions on the queries directory and/or if there is already a file P01349.fsa in the directory?

You do not need a cloud account to run ncbi/blast docker.

Tom

madden@eb-1524-blastx:~/DOCKER_COMPLAINT$ docker run --rm ncbi/blast efetch -db protein -format fasta -id P01349 > queries/P01349.fsa madden@eb-1524-blastx:~/DOCKER_COMPLAINT$ ls -l queries/ total 4 -rw-rw-r-- 1 madden madden 173 May 8 14:17 P01349.fsa madden@eb-1524-blastx:~/DOCKER_COMPLAINT$ ls -ld queries/ drwxrwxr-x 2 madden madden 4096 May 8 14:17 queries/