Closed cmajones closed 7 years ago
Can you type: head nr_viral.faa … do you have gi numbers within the fasta description in this file?
Also, can you type : head gi_to_des.tab Cheers, Pete
From: cmajones [mailto:notifications@github.com] Sent: 11 April 2017 13:51 To: peterthorpe5/public_scripts Cc: Subscribed Subject: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
Hi Peter,
Thank you for your script - I believe it will work for my needed use.
I am running a diamond blastx search of MGS files against a custom database of viral proteins from RefSeq (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/). I concatenated the viral.X.protein.faa files together and made a custom diamond database for my purposes.
I want to add taxonomy to my diamond output for use in MEGAN because it isn't properly processing my .daa output.
Back to your script - when I try to make the gi_to_des.tab file I get an error saying
Error: AssertionError: Error, gi_to_des.tab file is not formatted as expected. It wants Gi_number description. See help on how to make this file, or use the shell script.
Can you explain how to make the gi_to_des.tab file? I followed your instructions in wiki:
makeblastdb -in viral.1.2.nr.protein.faa -out nr_viral
blastdbcmd -entry 'all' -db viral.1.2.nr.protein.faa > nr.faa
python /home/casey/diamond_blast_to_taxid/prepare_gi_to_description_databse.py -i nr_viral.faa -o gi_to_des.tab
Do you have an idea what I may be doing wrong?
Thank you, Casey
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqHerTrXSBQHYRG1FN4LJidwTJrePks5ru3crgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Here is the header of the .faa file I used to make the BLAST db:
casey@kronos:~/blast/db$ head refseq_viral.faa
ref|YP_008320337.1| terminase small subunit [Paenibacillus phage phiIBB_Pl23] MKGGEPEMAVPTSKLIREYLGETYEESDEQLIQLYIETHQFYRRLQKEIKNSELMYEYTNKAGATNLVKNPLSIELTKTV QTLNNLLKSLGLTPAQRKKVVSEDDDDFDDF ref|YP_008320338.1| terminase large subunit [Paenibacillus phage phiIBB_Pl23] MTMTSTTSNLPGILSQPSSELLTNWYAEQVVQGHILASHKVMLAGKRHLDDLKRQGSKDFPYVFDEEKGHRPIVFIERFC KPSKGKFKQMIMQPWQHFILGNLYGWVHKETGLRRFTEGLIFIARKNGKSGLASGISIYGCTKDGERGADVYVLANSMKQ VRKTIFDECKKMIKASPQLKKKMKALRDVIEYKQTNSIIEPQASDSEKLDGLNTHLAVFDEIHEYKNYDLINIIKNSTDT REQPLLLYITTAGYQLDGPLVDYYELGADVLEGVVSDERTFYYMAELDSEEEIDNPDMWGKANPNLGVTYDLEKLKNAWE KRKNIPAERSDMIVKRFNIFVKADEMSFIDFNTLRKNNKHLDIDSLNGKTAIGSFDLSESEDFTSACLEFPLDTGEIFVL SHSWIPRKKVLANNEKIPYMQFVEDGSLTVCEAEYVEYEMIYDWFVNHSKTFSIEKIAYDRAKAFRLVKALESYGFQTEI
And the resulting gi_to_des.tab file looks like:
casey@kronos:~/blast/db$ head gi_to_des.tab YP_008320337.1 terminase small subunit [Paenibacillus phage phiIBB_Pl23] YP_008320338.1 terminase large subunit [Paenibacillus phage phiIBB_Pl23] YP_008320339.1 portal protein [Paenibacillus phage phiIBB_Pl23] YP_008320340.1 Clp protease-like protein [Paenibacillus phage phiIBB_Pl23] YP_008320341.1 major capsid protein [Paenibacillus phage phiIBB_Pl23] YP_008320342.1 hypothetical protein IBBPl23_06 [Paenibacillus phage phiIBB_Pl23] YP_008320343.1 head-tail connector protein [Paenibacillus phage phiIBB_Pl23] YP_008320344.1 head-tail joining protein [Paenibacillus phage phiIBB_Pl23] YP_008320345.1 hypothetical protein IBBPl23_09 [Paenibacillus phage phiIBB_Pl23] YP_008320346.1 hypothetical protein IBBPl23_10 [Paenibacillus phage phiIBB_Pl23]
Thanks for your quick reply! Casey
The names don’t start with gi numbers. These are YP_
I will code something up tomorrow that will convert these to gi number, or directly get the tax info …
From: cmajones [mailto:notifications@github.com] Sent: 11 April 2017 15:15 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
Here is the header of the .faa file I used to make the BLAST db:
casey@kronos:~/blast/db$ head refseq_viral.faa
ref|YP_008320337.1| terminase small subunit [Paenibacillus phage phiIBB_Pl23] MKGGEPEMAVPTSKLIREYLGETYEESDEQLIQLYIETHQFYRRLQKEIKNSELMYEYTNKAGATNLVKNPLSIELTKTV QTLNNLLKSLGLTPAQRKKVVSEDDDDFDDF ref|YP_008320338.1| terminase large subunit [Paenibacillus phage phiIBB_Pl23] MTMTSTTSNLPGILSQPSSELLTNWYAEQVVQGHILASHKVMLAGKRHLDDLKRQGSKDFPYVFDEEKGHRPIVFIERFC KPSKGKFKQMIMQPWQHFILGNLYGWVHKETGLRRFTEGLIFIARKNGKSGLASGISIYGCTKDGERGADVYVLANSMKQ VRKTIFDECKKMIKASPQLKKKMKALRDVIEYKQTNSIIEPQASDSEKLDGLNTHLAVFDEIHEYKNYDLINIIKNSTDT REQPLLLYITTAGYQLDGPLVDYYELGADVLEGVVSDERTFYYMAELDSEEEIDNPDMWGKANPNLGVTYDLEKLKNAWE KRKNIPAERSDMIVKRFNIFVKADEMSFIDFNTLRKNNKHLDIDSLNGKTAIGSFDLSESEDFTSACLEFPLDTGEIFVL SHSWIPRKKVLANNEKIPYMQFVEDGSLTVCEAEYVEYEMIYDWFVNHSKTFSIEKIAYDRAKAFRLVKALESYGFQTEI
And the resulting gi_to_des.tab file looks like:
casey@kronos:~/blast/db$ head gi_to_des.tab YP_008320337.1 terminase small subunit [Paenibacillus phage phiIBB_Pl23] YP_008320338.1 terminase large subunit [Paenibacillus phage phiIBB_Pl23] YP_008320339.1 portal protein [Paenibacillus phage phiIBB_Pl23] YP_008320340.1 Clp protease-like protein [Paenibacillus phage phiIBB_Pl23] YP_008320341.1 major capsid protein [Paenibacillus phage phiIBB_Pl23] YP_008320342.1 hypothetical protein IBBPl23_06 [Paenibacillus phage phiIBB_Pl23] YP_008320343.1 head-tail connector protein [Paenibacillus phage phiIBB_Pl23] YP_008320344.1 head-tail joining protein [Paenibacillus phage phiIBB_Pl23] YP_008320345.1 hypothetical protein IBBPl23_09 [Paenibacillus phage phiIBB_Pl23] YP_008320346.1 hypothetical protein IBBPl23_10 [Paenibacillus phage phiIBB_Pl23]
Thanks for your quick reply! Casey
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293276894, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqM5-hobn3eCt3Y3DZICZKaiKJu14ks5ru4rwgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Thanks so much! Is this error potentially occuring because I'm making a custom database using RefSeq viral sequences instead of the NCBI nr db?
Thanks again.
Yes if you use nr database it would work. The custom database is the problem. You could blast against NR, add the taxid using this script we are discussing then filter for viral hits only using: https://github.com/peterthorpe5/public_scripts/blob/master/blast_output/top_BLAST_hit_filter_out_tax_id.py
But that only gets you top hit, after filtering out a certain taxonomy.
I can code the other up. It may take a bit of time.
From: cmajones [mailto:notifications@github.com] Sent: 11 April 2017 16:07 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
Thanks so much! Is this error potentially occuring because I'm making a custom database using RefSeq viral sequences instead of the NCBI nr db?
Thanks again.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293293594, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqAUp_f1HXKAiC1Wm0mFVHmDeiOXAks5ru5cSgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Can you show me what your tabular diamond blast output looks like please? Again “head” will do. Thanks.
From: cmajones [mailto:notifications@github.com] Sent: 11 April 2017 16:07 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
Thanks so much! Is this error potentially occuring because I'm making a custom database using RefSeq viral sequences instead of the NCBI nr db?
Thanks again.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293293594, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqAUp_f1HXKAiC1Wm0mFVHmDeiOXAks5ru5cSgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Here it is:
NB551033:17:H27G7BGX2:1:11101:3237:1089 WP_055171390.1 58.0 50 21 0 151 2 269 318 1.2e-07 63.9 NB551033:17:H27G7BGX2:1:11101:3237:1089 WP_055256333.1 58.0 50 21 0 151 2 250 299 1.2e-07 63.9 NB551033:17:H27G7BGX2:1:11101:3237:1089 SCH20855.1 58.0 50 21 0 151 2 511 560 1.6e-07 63.5 NB551033:17:H27G7BGX2:1:11101:25987:1065 WP_055256332.1 58.0 50 20 1 3 149 34 83 6.5e-09 68.2 NB551033:17:H27G7BGX2:1:11101:25987:1065 WP_055171393.1 58.0 50 20 1 3 149 29 78 6.5e-09 68.2 NB551033:17:H27G7BGX2:1:11101:25987:1065 SCH20807.1 56.0 50 21 1 3 149 34 83 3.2e-08 65.9 NB551033:17:H27G7BGX2:1:11101:4896:1058 SCH20754.1 70.0 40 12 0 3 122 314 353 2.3e-06 59.7 NB551033:17:H27G7BGX2:1:11101:10573:1121 SCH20855.1 83.7 49 8 0 3 149 16 64 4.9e-17 95.1 NB551033:17:H27G7BGX2:1:11101:10573:1121 CDL65712.1 63.3 49 18 0 3 149 15 63 2.4e-11 76.3 NB551033:17:H27G7BGX2:1:11101:21716:1122 WP_055171390.1 78.0 50 11 0 1 150 349 398 1.9e-16 93.2
Thanks, Casey
Hi Casey,
Those blast results don’t match to names you used in the fasta file (pasted from what you sent me):
ref|YP_008320337.1| terminase small subunit [Paenibacillus phage phiIBB_Pl23] MKGGEPEMAVPTSKLIREYLGETYEESDEQLIQLYIETHQFYRRLQKEIKNSELMYEYTNKAGATNLVKNPLSIELTKTV QTLNNLLKSLGLTPAQRKKVVSEDDDDFDDF ref|YP_008320338.1| terminase large subunit [Paenibacillus phage phiIBB_Pl23] MTMTSTTSNLPGILSQPSSELLTNWYAEQVVQGHILASHKVMLAGKRHLDDLKRQGSKDFPYVFDEEKGHRPIVFIERFC
Did you blast against this fasta file? I don’t see where, for example, WP_055171390.1 comes from
Pete
From: cmajones [mailto:notifications@github.com] Sent: 12 April 2017 12:33 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
Here it is:
NB551033:17:H27G7BGX2:1:11101:3237:1089 WP_055171390.1 58.0 50 21 0 151 2 269 318 1.2e-07 63.9 NB551033:17:H27G7BGX2:1:11101:3237:1089 WP_055256333.1 58.0 50 21 0 151 2 250 299 1.2e-07 63.9 NB551033:17:H27G7BGX2:1:11101:3237:1089 SCH20855.1 58.0 50 21 0 151 2 511 560 1.6e-07 63.5 NB551033:17:H27G7BGX2:1:11101:25987:1065 WP_055256332.1 58.0 50 20 1 3 149 34 83 6.5e-09 68.2 NB551033:17:H27G7BGX2:1:11101:25987:1065 WP_055171393.1 58.0 50 20 1 3 149 29 78 6.5e-09 68.2 NB551033:17:H27G7BGX2:1:11101:25987:1065 SCH20807.1 56.0 50 21 1 3 149 34 83 3.2e-08 65.9 NB551033:17:H27G7BGX2:1:11101:4896:1058 SCH20754.1 70.0 40 12 0 3 122 314 353 2.3e-06 59.7 NB551033:17:H27G7BGX2:1:11101:10573:1121 SCH20855.1 83.7 49 8 0 3 149 16 64 4.9e-17 95.1 NB551033:17:H27G7BGX2:1:11101:10573:1121 CDL65712.1 63.3 49 18 0 3 149 15 63 2.4e-11 76.3 NB551033:17:H27G7BGX2:1:11101:21716:1122 WP_055171390.1 78.0 50 11 0 1 150 349 398 1.9e-16 93.2
Thanks, Casey
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293549827, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqKxdyuJJGQBolyXo71Ce3bJLNa_zks5rvLZZgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
My apologies - that was my output from the diamond blastx I did against the nr db from your suggestion.
See below for the tabular diamond blast output against the custom viral database:
NB551033:17:H27G7BGX2:1:11101:7377:1191 ref|YP_009160410.1| 46.9 49 26 0 147 1 39 87 4.5e-08 56.2 NB551033:17:H27G7BGX2:1:11101:7377:1191 ref|YP_009160395.1| 39.6 48 29 0 144 1 65 112 2.8e-05 47.0 NB551033:17:H27G7BGX2:1:11101:7377:1191 ref|YP_009160339.1| 52.6 38 18 0 114 1 27 64 1.8e-04 44.3 NB551033:17:H27G7BGX2:1:11101:18559:1328 ref|YP_009160395.1| 52.3 44 18 2 146 24 84 127 3.0e-04 43.5 NB551033:17:H27G7BGX2:1:11101:18559:1328 ref|YP_009160339.1| 64.3 28 10 0 143 60 37 64 8.9e-04 42.0 NB551033:17:H27G7BGX2:1:11101:4835:1375 ref|YP_009218538.1| 50.0 48 20 1 3 146 74 117 1.9e-06 50.8 NB551033:17:H27G7BGX2:1:11101:4835:1375 ref|YP_009160416.1| 39.1 46 24 1 12 149 46 87 8.9e-04 42.0 NB551033:17:H27G7BGX2:1:11101:14119:1571 ref|YP_009160395.1| 52.3 44 18 2 144 22 84 127 3.1e-04 43.5 NB551033:17:H27G7BGX2:1:11101:14119:1571 ref|YP_009160339.1| 64.3 28 10 0 141 58 37 64 9.0e-04 42.0 NB551033:17:H27G7BGX2:1:11101:21938:1768 ref|YP_009160410.1| 54.0 50 19 1 2 139 50 99 1.0e-07 55.1
I working on it. Im currently in the testing stage. I am slightly worried that your hits against NR look very different to mine. Mine are gi number. Can you send me the “head” or nr.faa , this will be massive. Don’t worry.
Pete
From: cmajones [mailto:notifications@github.com] Sent: 12 April 2017 14:15 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
My apologies - that was my output from the diamond blastx I did against the nr db from your suggestion.
See below for the tabular diamond blast output against the custom viral database:
NB551033:17:H27G7BGX2:1:11101:7377:1191 ref|YP_009160410.1| 46.9 49 26 0 147 1 39 87 4.5e-08 56.2 NB551033:17:H27G7BGX2:1:11101:7377:1191 ref|YP_009160395.1| 39.6 48 29 0 144 1 65 112 2.8e-05 47.0 NB551033:17:H27G7BGX2:1:11101:7377:1191 ref|YP_009160339.1| 52.6 38 18 0 114 1 27 64 1.8e-04 44.3 NB551033:17:H27G7BGX2:1:11101:18559:1328 ref|YP_009160395.1| 52.3 44 18 2 146 24 84 127 3.0e-04 43.5 NB551033:17:H27G7BGX2:1:11101:18559:1328 ref|YP_009160339.1| 64.3 28 10 0 143 60 37 64 8.9e-04 42.0 NB551033:17:H27G7BGX2:1:11101:4835:1375 ref|YP_009218538.1| 50.0 48 20 1 3 146 74 117 1.9e-06 50.8 NB551033:17:H27G7BGX2:1:11101:4835:1375 ref|YP_009160416.1| 39.1 46 24 1 12 149 46 87 8.9e-04 42.0 NB551033:17:H27G7BGX2:1:11101:14119:1571 ref|YP_009160395.1| 52.3 44 18 2 144 22 84 127 3.1e-04 43.5 NB551033:17:H27G7BGX2:1:11101:14119:1571 ref|YP_009160339.1| 64.3 28 10 0 141 58 37 64 9.0e-04 42.0 NB551033:17:H27G7BGX2:1:11101:21938:1768 ref|YP_009160410.1| 54.0 50 19 1 2 139 50 99 1.0e-07 55.1
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293571529, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqBbVnxGZ8I7YKHbiI0WcQK35jjkVks5rvM5WgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Hi Pete - see below:
WP_003131952.1 30S ribosomal protein S18 [Lactococcus lactis]NP_268346.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis Il1403]Q9CDN0.1 RecName: Full=30S ribosomal protein S18Q02VU1.1 RecName: Full=30S ribosomal protein S18A2RNZ2.1 RecName: Full=30S ribosomal protein S18AAK06287.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis Il1403]ABJ73931.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris SK11]CAL99037.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris MG1363]ADA65983.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis KF147]ADJ61439.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris NZ9000]ADZ64834.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis CV56]EHE92602.1 hypothetical protein LLCRE1631_01913 [Lactococcus lactis subsp. lactis CNCM I-1631]AEU41715.1 SSU ribosomal protein S18p [Lactococcus lactis subsp. cremoris A76]BAL52156.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis IO-1]AFW92578.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris UC509.9]CDG05746.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis A12]EQC53187.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis bv. diacetylactis str. TIFN4]EQC53393.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis bv. diacetylactis str. TIFN2]EQC54683.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris TIFN6]EQC56744.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris TIFN5]EQC82878.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris TIFN7]EQC91162.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris TIFN1]EQC94448.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris TIFN3]AGV74185.1 ribosomal protein S18 RpsR [Lactococcus lactis subsp. cremoris KW2]AGY45032.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis KLDS 4.0325]ESK79551.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis bv. diacetylactis str. LD61]KEY61992.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris GE214]AII13743.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis NCDO 2118]KGF77556.1 SSU ribosomal protein S18p SSU ribosomal protein S18p, zinc-independent [Lactococcus lactis]AIS04718.1 SSU ribosomal protein S18P [Lactococcus lactis]KGH32949.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris]KHE77803.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis 1AA59]KKW69436.1 ribosomal protein bS18, rpsR [Lactococcus lactis subsp. cremoris]KKW70341.1 ribosomal protein bS18, rpsR [Lactococcus lactis subsp. cremoris]KLK95226.1 ribosomal protein bS18, rpsR [Lactococcus lactis subsp. lactis]KRO21588.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]KST41693.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis bv. diacetylactis]KST76534.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KST79241.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KST81638.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KST85642.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KST88531.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KST92921.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KST97154.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KST98471.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KST99285.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU03686.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU05991.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU09388.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU13881.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU20925.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU23615.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU25349.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU27070.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU28321.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KSU32404.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis]KZK07251.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. cremoris]KZK08880.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. lactis bv. diacetylactis]KZK09361.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. cremoris]KZK33282.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. cremoris]KZK44117.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. cremoris]KZK46962.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. cremoris]KZK52810.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. cremoris]KZK53814.1 SSU ribosomal protein S18p SSU ribosomal protein S18p zinc-independent [Lactococcus lactis subsp. cremoris]OAJ97698.1 30S ribosomal protein S18 [Lactococcus lactis]OAZ16676.1 30S ribosomal protein S18 [Lactococcus lactis RTB018]SBW31684.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]OEU38668.1 30S ribosomal protein S18 [Lactococcus lactis subsp. cremoris IBB477]OJH46247.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis bv. diacetylactis]ONK31551.1 30S ribosomal protein S18 [Lactococcus lactis subsp. lactis]ARD92294.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]ARD97280.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis]ARD99957.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis]ARE04690.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis]ARE06709.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]ARE09571.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis]ARE12078.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis]ARE14468.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis]ARE16888.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis]ARE19344.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]ARE21948.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. lactis]ARE24261.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris]ARE27001.1 SSU ribosomal protein S18P [Lactococcus lactis subsp. cremoris] MAQQRRGGFKRRKKVDFIAANKIEVVDYKDTELLKRFISERGKILPRRVTGTSAKNQRKVVNAIKRARVMALLPFVAEDQ N XP_642131.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4]P54670.1 RecName: Full=Calfumirin-1; Short=CAF-1BAA06266.1 calfumirin-1 [Dictyostelium discoideum AX2]EAL68086.1 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK VQKLLNPDQ XP_642837.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4]EAL68957.1 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4] MKTKSSNNIKKIYYISSILVGIYLCWQIIIQIIFLMDNSIAILEAIGMVVFISVYSLAVAINGWILVGRMKKSSKKAQYE DFYKKMILKSKILLSTIIIVIIVVVVQDIVINFILPQNPQPYVYMIISNFIVGIADSFQMIMVIFVMGELSFKNYFKFKR
In contrast, this is what the header of my custom viral DB looks like:
ref|YP_008320337.1| terminase small subunit [Paenibacillus phage phiIBB_Pl23] MKGGEPEMAVPTSKLIREYLGETYEESDEQLIQLYIETHQFYRRLQKEIKNSELMYEYTNKAGATNLVKNPLSIELTKTV QTLNNLLKSLGLTPAQRKKVVSEDDDDFDDF ref|YP_008320338.1| terminase large subunit [Paenibacillus phage phiIBB_Pl23] MTMTSTTSNLPGILSQPSSELLTNWYAEQVVQGHILASHKVMLAGKRHLDDLKRQGSKDFPYVFDEEKGHRPIVFIERFC KPSKGKFKQMIMQPWQHFILGNLYGWVHKETGLRRFTEGLIFIARKNGKSGLASGISIYGCTKDGERGADVYVLANSMKQ VRKTIFDECKKMIKASPQLKKKMKALRDVIEYKQTNSIIEPQASDSEKLDGLNTHLAVFDEIHEYKNYDLINIIKNSTDT REQPLLLYITTAGYQLDGPLVDYYELGADVLEGVVSDERTFYYMAELDSEEEIDNPDMWGKANPNLGVTYDLEKLKNAWE KRKNIPAERSDMIVKRFNIFVKADEMSFIDFNTLRKNNKHLDIDSLNGKTAIGSFDLSESEDFTSACLEFPLDTGEIFVL SHSWIPRKKVLANNEKIPYMQFVEDGSLTVCEAEYVEYEMIYDWFVNHSKTFSIEKIAYDRAKAFRLVKALESYGFQTEI
OK. Im completely updating it so it will work with both new data and legacy data. It will also work with your custom database. This will take me another day or so. I need to test before release.
How much RAM do you have access to in your server? These new files are massive! If you have more that 40GB then fine.
Pete
From: cmajones [mailto:notifications@github.com] Sent: 12 April 2017 15:58 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
In contrast, this is what the header of my custom viral DB looks like:
ref|YP_008320337.1| terminase small subunit [Paenibacillus phage phiIBB_Pl23] MKGGEPEMAVPTSKLIREYLGETYEESDEQLIQLYIETHQFYRRLQKEIKNSELMYEYTNKAGATNLVKNPLSIELTKTV QTLNNLLKSLGLTPAQRKKVVSEDDDDFDDF ref|YP_008320338.1| terminase large subunit [Paenibacillus phage phiIBB_Pl23] MTMTSTTSNLPGILSQPSSELLTNWYAEQVVQGHILASHKVMLAGKRHLDDLKRQGSKDFPYVFDEEKGHRPIVFIERFC KPSKGKFKQMIMQPWQHFILGNLYGWVHKETGLRRFTEGLIFIARKNGKSGLASGISIYGCTKDGERGADVYVLANSMKQ VRKTIFDECKKMIKASPQLKKKMKALRDVIEYKQTNSIIEPQASDSEKLDGLNTHLAVFDEIHEYKNYDLINIIKNSTDT REQPLLLYITTAGYQLDGPLVDYYELGADVLEGVVSDERTFYYMAELDSEEEIDNPDMWGKANPNLGVTYDLEKLKNAWE KRKNIPAERSDMIVKRFNIFVKADEMSFIDFNTLRKNNKHLDIDSLNGKTAIGSFDLSESEDFTSACLEFPLDTGEIFVL SHSWIPRKKVLANNEKIPYMQFVEDGSLTVCEAEYVEYEMIYDWFVNHSKTFSIEKIAYDRAKAFRLVKALESYGFQTEI
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293604442, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqEBmOm5bx3SqtsImWA2t_tyjgiLNks5rvOZqgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Thanks so much Peter! We have a large server (~250 GB RAM) so shouldn't be an issue.
Im now in the testing phase. I have coded up the changes. This should now work with old and new data.
When you send me the “head” of your fasta files, your sequences don’t start with > They should look like.
ref|YP_008320337.1| terminase small subunit [Paenibacillus phage phiIBB_Pl23] MKGGEPEMAVPTSKLIREYLGETYEESDEQLIQLYIETHQFYRRLQKEIKNSELMYEYTNKAGATNLVKNPLSIELTKTV QTLNNLLKSLGLTPAQRKKVVSEDDDDFDDF
Pete
From: cmajones [mailto:notifications@github.com] Sent: 12 April 2017 15:58 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
In contrast, this is what the header of my custom viral DB looks like:
ref|YP_008320337.1| terminase small subunit [Paenibacillus phage phiIBB_Pl23] MKGGEPEMAVPTSKLIREYLGETYEESDEQLIQLYIETHQFYRRLQKEIKNSELMYEYTNKAGATNLVKNPLSIELTKTV QTLNNLLKSLGLTPAQRKKVVSEDDDDFDDF ref|YP_008320338.1| terminase large subunit [Paenibacillus phage phiIBB_Pl23] MTMTSTTSNLPGILSQPSSELLTNWYAEQVVQGHILASHKVMLAGKRHLDDLKRQGSKDFPYVFDEEKGHRPIVFIERFC KPSKGKFKQMIMQPWQHFILGNLYGWVHKETGLRRFTEGLIFIARKNGKSGLASGISIYGCTKDGERGADVYVLANSMKQ VRKTIFDECKKMIKASPQLKKKMKALRDVIEYKQTNSIIEPQASDSEKLDGLNTHLAVFDEIHEYKNYDLINIIKNSTDT REQPLLLYITTAGYQLDGPLVDYYELGADVLEGVVSDERTFYYMAELDSEEEIDNPDMWGKANPNLGVTYDLEKLKNAWE KRKNIPAERSDMIVKRFNIFVKADEMSFIDFNTLRKNNKHLDIDSLNGKTAIGSFDLSESEDFTSACLEFPLDTGEIFVL SHSWIPRKKVLANNEKIPYMQFVEDGSLTVCEAEYVEYEMIYDWFVNHSKTFSIEKIAYDRAKAFRLVKALESYGFQTEI
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293604442, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqEBmOm5bx3SqtsImWA2t_tyjgiLNks5rvOZqgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Hi Pete,
The .faa's do start with ">" I believe it was the formatting that changed it:
When you push the changes I will give it a try as well and let you know. Thanks so much!
Ok good.
From: cmajones [mailto:notifications@github.com] Sent: 13 April 2017 14:37 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
Hi Pete,
The .faa's do start with ">" I believe it was the formatting that changed it: [Image removed by sender. image]https://cloud.githubusercontent.com/assets/26903976/25006898/039f43fe-2035-11e7-8160-57c66559c7f9.png
When you push the changes I will give it a try as well and let you know. Thanks so much!
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293897305, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqIcWmpgtHqfnwq_EpSMbDmSF9Xw4ks5rviTmgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Ok, so this now works with small test data I have set up. I am testing it on big data now. If you want to download it and try it. The current working version is the latest on Github. Delete the old one.
You will need to make a new accession to taxid database.
Download the following: wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz.md5 wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz md5sum -c prot.accession2taxid.gz.md5 gunzip prot.accession2taxid.gz
Then run: python prepare_accession_to_description_db.py (-d default is 4)
by default this will bring back 4 description (if there are less than 4, it will bring back all of them), associated with that accession code. If you want more, e.g. -d 10 do it. But the file will get big and hard to read.
Let me know how you get on. GOOD or BAD!
Pete
From: cmajones [mailto:notifications@github.com] Sent: 13 April 2017 14:37 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] gi_to_des.tab file creation issues (#3)
Hi Pete,
The .faa's do start with ">" I believe it was the formatting that changed it: [Image removed by sender. image]https://cloud.githubusercontent.com/assets/26903976/25006898/039f43fe-2035-11e7-8160-57c66559c7f9.png
When you push the changes I will give it a try as well and let you know. Thanks so much!
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/3#issuecomment-293897305, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqIcWmpgtHqfnwq_EpSMbDmSF9Xw4ks5rviTmgaJpZM4M6DVR.
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Hi Pete,
I'm unsure that this has been updated on GitHub - could you link me to current working version?
Best, Casey
Hi Casey,
If it fails, could you please tell me how? It should work, the latest code is up … Don’t run the Download_all_files_I_need.shhttps://github.com/peterthorpe5/public_scripts/blob/master/Diamond_BLAST_add_taxonomic_info/Download_all_files_I_need.sh – I haven’t got this running properly yet
The diamond people have updated the tool to return tax info: https://github.com/bbuchfink/diamond/blob/master/diamond_manual.pdf
cheers,
Pete
The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796
Hi Peter,
Thank you for your script - I believe it will work for my needed use.
I am running a diamond blastx search of MGS files against a custom database of viral proteins from RefSeq (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/). I concatenated the viral.X.protein.faa files together and made a custom diamond database for my purposes.
I want to add taxonomy to my diamond output for use in MEGAN because it isn't properly processing my .daa output.
Back to your script - when I try to make the gi_to_des.tab file I get an error saying
Error: AssertionError: Error, gi_to_des.tab file is not formatted as expected. It wants Gi_number description. See help on how to make this file, or use the shell script.
Can you explain how to make the gi_to_des.tab file? I followed your instructions in wiki:
makeblastdb -in viral.1.2.nr.protein.faa -out nr_viral
blastdbcmd -entry 'all' -db viral.1.2.nr.protein.faa > nr.faa
python /home/casey/diamond_blast_to_taxid/prepare_gi_to_description_databse.py -i nr_viral.faa -o gi_to_des.tab
Do you have an idea what I may be doing wrong?
Thank you, Casey