peterthorpe5 / public_scripts

collection of bioinformatic scripts
29 stars 23 forks source link

fail to run Diamond_blast_to_taxid.py #2

Closed jie11owen closed 7 years ago

jie11owen commented 7 years ago

Dear Peter,

I feel so lucky to find such a scirpt you wrote to solve the taxid problem from diamond output. However, when I run Diamond_blast_to_taxid.py, I got a problem like this: Traceback (most recent call last): File "Diamond_blast_to_taxid.py", line 599, in parse_diamond_tab(diamond_tab_output, path_files, gi_taxid_prot, categories, names, gi_to_des, outfile) File "Diamond_blast_to_taxid.py", line 170, in parse_diamond_tab taxon_to_kingdom = assign_cat_to_dic(categories) File "Diamond_blast_to_taxid.py", line 100, in assign_cat_to_dic kingdom_tax_id[int(tax_id)]= kingdom_dic[kingdom] KeyError: 'O'

Do you know what kind of mistake I might make . Thanks for your time and help.

peterthorpe5 commented 7 years ago

Hi Jie11owen.

Can you please return the top info from each file used by typing: head NAME_PF_FILE.

Also, can you tell me the exact command you used to run the script. I take it you set up the database before hand? Pete

jie11owen commented 7 years ago

Hi Peter,

Thanks for your immediate reply. Here is my command line to run the script "python Diamond_blast_to_taxid.py -i nr0.out -t gi_taxid_prot.dmp -c categories.dmp -n names.dmp -d gi_to_des.tab -o /home/wj/LM_Analysis/DIAMOND/output.tab". The following includes the information of each file. If you need more information, please tell me.

wj@line-HP-Z820-Workstation:~/LM_Analysis/DIAMOND$ head nr0.out seq1370_len536_cov47 66813510 56.2 48 21 0 144 1 882 929 1.70E-08 54.7 seq1370_len536_cov47 90421313 40.6 64 38 0 195 4 1005 1068 3.60E-06 47 seq3147_len123_cov25 66819573 55 40 14 1 3 122 439 474 2.10E-07 48.9 seq3825_len128_cov19 66810796 75 24 6 0 56 127 40 63 3.60E-05 41.6 seq3825_len128_cov19 67517630 75 24 6 0 56 127 40 63 1.00E-04 40 seq3825_len128_cov19 70995816 75 24 6 0 56 127 40 63 1.00E-04 40 seq4393_len820_cov44 66825411 48.8 41 21 0 143 21 15 55 8.60E-07 49.7 seq5963_len168_cov78 85068355 92.5 53 4 0 160 2 166 218 4.50E-24 104.8 seq5963_len168_cov78 4506209 77.4 53 12 0 160 2 209 261 4.40E-19 88.2 seq5963_len168_cov78 66818341 73.6 53 14 0 160 2 204 256 5.70E-19 87.8

wj@line-HP-Z820-Workstation:~/LM_Analysis/DIAMOND$ head gi_taxid_prot.dmp 6 9913 8 9913 10 9913 12 9913 14 9913 32 9913 35 9913 42 9913 44 9913 46 9913

wj@line-HP-Z820-Workstation:~/LM_Analysis/DIAMOND$ head categories.dmp B 7 7 B 9 9 B 11 11 B 14 14 B 17 17 B 19 19 B 21 21 B 23 23 B 24 24 B 25 25

wj@line-HP-Z820-Workstation:~/LM_Analysis/DIAMOND$ head names.dmp 1 | all |  | synonym | 1 | root |  | scientific name | 2 | Bacteria | Bacteria  | scientific name | 2 | Monera | Monera  | in-part | 2 | Procaryotae | Procaryotae  | in-part | 2 | Prokaryota | Prokaryota  | in-part | 2 | Prokaryotae | Prokaryotae  | in-part | 2 | bacteria | bacteria  | blast name | 2 | eubacteria |  | genbank common name | 2 | not Bacteria Haeckel 1894 |  | synonym |

wj@line-HP-Z820-Workstation:~/LM_Analysis/DIAMOND$ head gi_to_des.tab 489223532 30S ribosomal protein S18 [Lactococcus lactis] 66816243 hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] 66818355 hypothetical protein DDB_G0276911 [Dictyostelium discoideum AX4] 446106212 MULTISPECIES: antibiotic transporter [Bacillus] 494110381 MULTISPECIES: argininosuccinate lyase [Bifidobacterium] 446057344 MULTISPECIES: 30S ribosomal protein S18 [Bacteria] 489344060 MULTISPECIES: leucyl/phenylalanyl-tRNA--protein transferase [Pseudomonas] 489505010 MULTISPECIES: SecB-like chaperone [Mycobacterium tuberculosis complex] 446301966 MULTISPECIES: O-acetyltransferase OatA [Staphylococcus] 446254182 MULTISPECIES: ribonucleoside-diphosphate reductase 1 subunit beta [Proteobacteria]

Cheers Jie

peterthorpe5 commented 7 years ago

Im re-running the "making database" steps myself, as I think NCBI have changed the format of one of these files. Also, it looks like your run failed when trying to get the data for an entry with tax_id = 0 ... I will look into this. As you know making the DB take a couple of hours.

peterthorpe5 commented 7 years ago

OK. I know what is going wrong here. Can you type head on the nr.fasta file that you used to make the database with? The fasta file we have on our servers here may be a little bit old now and it still uses gi numbers (if they have altered this). For example my blast output is:

TRINITY_DN10008_c0_g2_i1|m.50805 gi|685832877|emb|CEF67798.1| 46.9 49 26 0 22 70 110 158 1.0e-06 62.0

Whereas your file (look at column 2): seq1370_len536_cov47 66813510 56.2 48 21 0 144 1 882 929 1.70E-08 54.7

So in the second column I get the gi number. I want to see how your nr.fasta looks please.

The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796

jie11owen commented 7 years ago

I got the same output format like yours (i.e. gi|685832877|emb|CEF67798.1| ) at the begining, but when I run the script, it reported the error. So I changed the blast output using excel and then got the format like this (seq1370_len536_cov47 66813510 56.2 48 21 0 144 1 882 929 1.70E-08 54.7). Should I changed it to “seq1370_len536_cov47 gi|66813510 56.2 48 21 0 144 1 882 929 1.70E-08 54.7” ?

The head of my nr.fasta is: >gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK VQKLLNPDQ

peterthorpe5 commented 7 years ago

Never ever use Excel to change data with is going to go into scripts. It changes thing without you knowing! I have found the problem. In the new catergories.dmp file from NCBI, there is a new category which was not available when I coded it called “O”. I am fixing this now.

From: jie11owen [mailto:notifications@github.com] Sent: 10 April 2017 12:52 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] fail to run Diamond_blast_to_taxid.py (#2)

I got the same output format like yours (i.e. gi|685832877|emb|CEF67798.1| ) at the begining, but when I run the script, it reported the error. So I changed the blast output using excel and then got the format like this (seq1370_len536_cov47 66813510 56.2 48 21 0 144 1 882 929 1.70E-08 54.7). Should I changed it to “seq1370_len536_cov47 gi|66813510 56.2 48 21 0 144 1 882 929 1.70E-08 54.7” ?

The head of my nr.fasta is:

gi|66816243|ref|XP_642131.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] >gi|1705556|sp|P54670.1|CAF1_DICDI RecName: Full=Calfumirin-1; Short=CAF-1 >gi|793761|dbj|BAA06266.1| calfumirin-1 [Dictyostelium discoideum AX2] >gi|60470106|gb|EAL68086.1| hypothetical protein DDB_G0277827 [Dictyostelium discoideum AX4] MASTQNIVEEVQKMLDTYDTNKDGEITKAEAVEYFKGKKAFNPERSAIYLFQVYDKDNDGKITIKELAGDIDFDKALKEY KEKQAKSKQQEAEVEEDIEAFILRHNKDDNTDITKDELIQGFKETGAKDPEKSANFILTEMDTNKDGTITVKELRVYYQK VQKLLNPDQ

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/2#issuecomment-292927356, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqG1Q8Lg7RH_1hEoCh61g6ss3kUfnks5ruhfggaJpZM4M4aWv.

The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796

jie11owen commented 7 years ago

Thanks for your suggestions, I will not use excel any more. I am looking forward to your newly revised script.

peterthorpe5 commented 7 years ago

This is now updated and should work. Please use the file with the gi numbers in the second colomn.

Enjoy! Pete

jie11owen commented 7 years ago

It works now. Thank you so much for your kind help.

peterthorpe5 commented 7 years ago

From: jie11owen [mailto:notifications@github.com] Sent: 10 April 2017 14:49 To: peterthorpe5/public_scripts Cc: Peter Thorpe; Comment Subject: Re: [peterthorpe5/public_scripts] fail to run Diamond_blast_to_taxid.py (#2)

It works now. Thank you so much for your kind help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/peterthorpe5/public_scripts/issues/2#issuecomment-292954633, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJhCqOzdYPDILzK9PAQSL74624MKkihSks5rujNfgaJpZM4M4aWv.

The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796