Closed ndreey closed 1 year ago
Actually, i will look into NCBITaxa, taxize and other taxonomy R packages and see if i can run the taxids through there to separate fungi, orchid and not_euk.
Correction
I am using the taxonomic_profile_1.txt
as a taxonomy database to map taxid with a superkingdom. There is no need to get info on the strain taxid as they are an artifact from a stimulated run. taxonomic_profile_1.txt
is generated during CAMISIM.
I removed the thaliana row and added these lines to taxonomic_profile_1.txt
2320716 species 2759|2320716 Eukaryota|Platanthera zijinensis 0.0161
156515 species 2759|156515 Eukaryota|Tulasnella calospora 0.0161
1287689 species 2759|1287689 Eukaryota|Rhizoctonia solani 0.0161
305860 species 2759|305860 Eukaryota|Ceratobasidium sp. AG-I 0.0161
The script is generating 897 rows now which is my total number of genomes now.
What i need to do now is make an ìf statement` that groups P. zijinensis in its own group so i get (orchid, euk, not_euk).
Might just change the group names to (orchid, fungi, not_euk).
I will close this issue when get_data groups them into three groups.
subset_data.R
creates this data frame as of March 15.
genome_id taxid rank size
1 Platanthera_zijinensis_chr 2320716 host 4186550321
2 Phoma_radicina_MPI-SP2-AT-0466 565429 fungi 38215437
3 Aspergillus_fumigatus_MPI-SW4-AT-0569 746128 fungi 47521149
4 Sarocladium_strictum_MPI-IT2-AT-0306 5046 fungi 31483859
5 Pyrenochaeta_lycopersici_MPI-FR1-AT-0381 285811 fungi 54218686
6 Peyronellaea_curtisii_MPI-SP2-AT-0415 749631 fungi 36822598
7 Talaromyces_verruculosus_MPI-SP2-AT-0411 198730 fungi 32407473
8 Penicillium_canescens_MPI-SW4-AT-0573 5083 fungi 32967678
9 Cylindrocarpon_pauciseptatum_MPI-SP2-AT-0468 465806 fungi 60797881
10 Verticillium_dahliae_MPI-FR1-AT-0353 27337 fungi 36758448
11 Alternaria_tenuis_MPI-SDFR-AT-0071 5599 fungi 34443718
12 Fusarium_oxysporum_MPI-CAGE-AT-0013 5507 fungi 56548958
13 Cladosporium_rectoides_MPI-GEGE-AT-0032 887101 fungi 33942924
14 Gibellulopsis_nigrescens_MPI-SP2-AT-0410 796325 fungi 35340245
15 Metacordyceps_chlamydosporia_MPI-IT2-AT-0323 280754 fungi 43916484
16 Ochroconis_tshawytschae_MPI-SP2-AT-0416 262132 fungi 36199058
17 Phomopsis_columnaris_MPI-SP2-AT-0504 193000 fungi 88466628
18 Umbelopsis_autotrophica_MPI-SW4-AT-0611 979767 fungi 27968122
19 Embellisia_chlamydospora_MPI-FR1-AT-0336 247032 fungi 37793906
20 Phialocephala_fortinii_MPI-SW4-AT-0551 62722 fungi 76760107
21 Phaeosphaeria_eustoma_MPI-FR1-AT-0339 85909 fungi 51461602
22 Truncatella_angustata_MPI-SW4-AT-0541 152316 fungi 58029377
23 Macrophomina_phaseolina_MPI-FR1-AT-0330 35725 fungi 58552437
24 Rhizopycnis_vagum_MPI-FR1-AT-0346 1589764 fungi 53344606
25 Paraphoma_chrysanthemicola_MPI-SDFR-AT-0093 798071 fungi 42351520
26 Neonectria_radicicola_MPI-CAGE-CH-0236 64609 fungi 75485037
27 Cryptosporiopsis_ericae_MPI-SW4-AT-0549 1663492 fungi 64936402
28 Leptodontidium_orchidicola_MPI-SW4-AT-0643 1732013 fungi 79839788
29 Myrothecium_cinctum_MPI-SP2-AT-0408 1860054 fungi 44683684
30 Ilyonectria_macrodidyma_MPI-GEGE-AT-0033 307937 fungi 74816530
31 Hypocrea_atroviridis_MPI-SP2-AT-0434 63577 fungi 39107378
32 Plectosphaerella_cucumerina_MPI-FR1-AT-0340 40658 fungi 39476090
33 Tulasnella_calospora_Tulcal1 156515 OMF 70372952
34 Ceratobasidium_sp_CerAGI 305860 OMF 58444101
35 Rhizoctonia_solani_Rhisola1 1287689 OMF 39822884
36 Otu721.0 627192 bacteria 4199332
37 Otu967 1345695 bacteria 5107814
38 Otu876 452863 bacteria 4395537
39 Otu398.0 1283299 bacteria 5695238
40 Otu1003.0 590998 bacteria 4266344
41 Otu1093 591158 bacteria 9175669
42 Otu1015 324057 bacteria 7184930
43 Otu633.0 1367847 bacteria 3613807
44 Otu31.0 658612 bacteria 5790538
45 Otu902 1217712 bacteria 4047559
46 Otu480.0 485918 bacteria 9127347
47 Otu884 1406 bacteria 5762608
48 Otu977 1121377 bacteria 4452642
49 Otu983 652017 bacteria 3040130
50 Otu997 694427 bacteria 3685504
51 Otu1016 1131731 bacteria 4223247
52 Otu856 1005941 bacteria 3426806
53 Otu890.0 1312959 bacteria 3894834
54 Otu964 1927 bacteria 8197540
55 Otu831 1121917 bacteria 3059517
56 Otu560.0 979226 bacteria 4772825
57 Otu981 323097 bacteria 4406967
58 Otu934.0 1285583 bacteria 3113488
59 Otu1072.0 929712 bacteria 5521807
60 Otu813 456442 archaea 2542943
61 Otu905 315750 bacteria 3704465
62 Otu1059 290397 bacteria 5013479
63 Otu1044.0 404380 bacteria 4615150
64 Otu872 1439940 bacteria 4865289
65 Otu935 443255 bacteria 6760392
66 Otu931 1121362 bacteria 3135752
67 Otu258.0 1124780 bacteria 4805697
68 Otu981.0 204669 bacteria 5650368
69 Otu618.0 236814 bacteria 4364663
70 Otu675.0 1298862 bacteria 5641932
71 RNODE_165_length_1669_cov_2.37887 45202 plasmid 16470
72 RNODE_178_length_3618_cov_3.32119 45202 plasmid 35960
73 RNODE_201_length_2438_cov_4.66846 32644 unknown 24160
74 RNODE_241_length_10461_cov_3.82000 1214906 virus 104390
75 RNODE_244_length_5333_cov_28.75937 45202 plasmid 53110
76 RNODE_250_length_5534_cov_8.34289 32644 unknown 55120
77 RNODE_262_length_4000_cov_14.35495 45202 plasmid 39780
78 RNODE_278_length_7999_cov_5.67456 45202 plasmid 79770
79 RNODE_288_length_3164_cov_59.95449 45202 plasmid 31420
80 RNODE_293_length_11038_cov_5.01480 45202 plasmid 110160
81 RNODE_297_length_1810_cov_21.16499 45202 plasmid 17880
82 RNODE_327_length_1948_cov_35.31308 32644 unknown 19260
83 RNODE_329_length_4403_cov_3.14905 32644 unknown 43810
84 RNODE_405_length_1632_cov_8.96149 45202 plasmid 16100
85 RNODE_41_length_1982_cov_209.61480 45202 plasmid 19600
86 RNODE_447_length_1716_cov_15.11213 45202 plasmid 16720
87 RNODE_454_length_3059_cov_7.56632 32644 unknown 29710
88 RNODE_45_length_2213_cov_3.74441 32644 unknown 21910
89 RNODE_466_length_2414_cov_22.47806 45202 plasmid 23480
90 RNODE_482_length_12608_cov_5.40546 45202 plasmid 125640
91 RNODE_4_length_1367_cov_2.97100 32644 unknown 13450
92 RNODE_519_length_4527_cov_27.40475 45202 plasmid 44830
93 RNODE_53_length_5801_cov_10.05503 32644 unknown 57790
94 RNODE_567_length_2043_cov_21.53474 32644 unknown 19550
95 RNODE_568_length_4215_cov_4.24868 45202 plasmid 41710
96 RNODE_595_length_1522_cov_10.57355 32644 unknown 14780
97 RNODE_59_length_7149_cov_6.99495 45202 plasmid 71270
98 RNODE_6_length_1955_cov_9.45784 45202 plasmid 19330
99 RNODE_72_length_2497_cov_7.54950 32644 unknown 24750
100 RNODE_86_length_2125_cov_11.11507 32644 unknown 21030
source_genomes/
contains: 894 genomes
Mock Data contains 100 genomes and is a subset
of the 897 genomes.
TOTAL: 100
I will close #26 after confirmation with supervisor.
I will close this issue when get_data groups them into three groups.
Updated the script to keep the BioProj references instead of giving them the "Otu" names (believe they are made up). This is the references bacteria/archaea has.
36 PRJNA67115 627192 bacteria 4199332
37 PRJNA217481 1345695 bacteria 5107814
38 PRJNA20011 452863 bacteria 4395537
39 PRJNA186462 1283299 bacteria 5695238
40 PRJNA33691 590998 bacteria 4266344
41 PRJNA33599 591158 bacteria 9175669
42 PRJNA20399 324057 bacteria 7184930
43 PRJNA212980 1367847 bacteria 3613807
44 PRJNA261945 658612 bacteria 5790538
45 PRJNA183309 1217712 bacteria 4047559
46 PRJNA27951 485918 bacteria 9127347
47 PRJNA261104 1406 bacteria 5762608
48 PRJNA183018 1121377 bacteria 4452642
49 PRJNA238302 652017 bacteria 3040130
50 PRJNA42009 694427 bacteria 3685504
51 PRJNA80827 1131731 bacteria 4223247
52 PRJNA171367 1005941 bacteria 3426806
53 PRJNA232079 1312959 bacteria 3894834
54 PRJNA242829 1927 bacteria 8197540
55 PRJNA165395 1121917 bacteria 3059517
56 PRJNA81617 979226 bacteria 4772825
57 PRJNA13473 323097 bacteria 4406967
58 PRJNA186910 1285583 bacteria 3113488
59 PRJNA63851 929712 bacteria 5521807
60 PRJNA18505 456442 archaea 2542943
61 PRJNA20391 315750 bacteria 3704465
62 PRJNA12634 290397 bacteria 5013479
63 PRJNA17707 404380 bacteria 4615150
64 PRJNA232351 1439940 bacteria 4865289
65 PRJNA42475 443255 bacteria 6760392
66 PRJNA168616 1121362 bacteria 3135752
67 PRJNA182711 1124780 bacteria 4805697
68 PRJNA15771 204669 bacteria 5650368
69 PRJNA256039 236814 bacteria 4364663
70 PRJNA190819 1298862 bacteria 5641932
New update on the mock_df
structure (March 16) #34 .
rank = taxonomy
and group = <unique group name>
genome_id rank taxid group size
1 Platanthera_zijinensis_chr orchid 2320716 host 4186550321
2 Phoma_radicina_MPI-SP2-AT-0466 fungi 565429 rfungi 38215437
3 Aspergillus_fumigatus_MPI-SW4-AT-0569 fungi 746128 rfungi 47521149
4 Sarocladium_strictum_MPI-IT2-AT-0306 fungi 5046 rfungi 31483859
5 Pyrenochaeta_lycopersici_MPI-FR1-AT-0381 fungi 285811 rfungi 54218686
6 Peyronellaea_curtisii_MPI-SP2-AT-0415 fungi 749631 rfungi 36822598
7 Talaromyces_verruculosus_MPI-SP2-AT-0411 fungi 198730 rfungi 32407473
8 Penicillium_canescens_MPI-SW4-AT-0573 fungi 5083 rfungi 32967678
9 Cylindrocarpon_pauciseptatum_MPI-SP2-AT-0468 fungi 465806 rfungi 60797881
10 Verticillium_dahliae_MPI-FR1-AT-0353 fungi 27337 rfungi 36758448
11 Alternaria_tenuis_MPI-SDFR-AT-0071 fungi 5599 rfungi 34443718
12 Fusarium_oxysporum_MPI-CAGE-AT-0013 fungi 5507 rfungi 56548958
13 Cladosporium_rectoides_MPI-GEGE-AT-0032 fungi 887101 rfungi 33942924
14 Gibellulopsis_nigrescens_MPI-SP2-AT-0410 fungi 796325 rfungi 35340245
15 Metacordyceps_chlamydosporia_MPI-IT2-AT-0323 fungi 280754 rfungi 43916484
16 Ochroconis_tshawytschae_MPI-SP2-AT-0416 fungi 262132 rfungi 36199058
17 Phomopsis_columnaris_MPI-SP2-AT-0504 fungi 193000 rfungi 88466628
18 Umbelopsis_autotrophica_MPI-SW4-AT-0611 fungi 979767 rfungi 27968122
19 Embellisia_chlamydospora_MPI-FR1-AT-0336 fungi 247032 rfungi 37793906
20 Phialocephala_fortinii_MPI-SW4-AT-0551 fungi 62722 rfungi 76760107
21 Phaeosphaeria_eustoma_MPI-FR1-AT-0339 fungi 85909 rfungi 51461602
22 Truncatella_angustata_MPI-SW4-AT-0541 fungi 152316 rfungi 58029377
23 Macrophomina_phaseolina_MPI-FR1-AT-0330 fungi 35725 rfungi 58552437
24 Rhizopycnis_vagum_MPI-FR1-AT-0346 fungi 1589764 rfungi 53344606
25 Paraphoma_chrysanthemicola_MPI-SDFR-AT-0093 fungi 798071 rfungi 42351520
26 Neonectria_radicicola_MPI-CAGE-CH-0236 fungi 64609 rfungi 75485037
27 Cryptosporiopsis_ericae_MPI-SW4-AT-0549 fungi 1663492 rfungi 64936402
28 Leptodontidium_orchidicola_MPI-SW4-AT-0643 fungi 1732013 rfungi 79839788
29 Myrothecium_cinctum_MPI-SP2-AT-0408 fungi 1860054 rfungi 44683684
30 Ilyonectria_macrodidyma_MPI-GEGE-AT-0033 fungi 307937 rfungi 74816530
31 Hypocrea_atroviridis_MPI-SP2-AT-0434 fungi 63577 rfungi 39107378
32 Plectosphaerella_cucumerina_MPI-FR1-AT-0340 fungi 40658 rfungi 39476090
33 Tulasnella_calospora_Tulcal1 fungi 156515 OMF 70372952
34 Ceratobasidium_sp_CerAGI fungi 305860 OMF 58444101
35 Rhizoctonia_solani_Rhisola1 fungi 1287689 OMF 39822884
36 PRJNA67115 bacteria 627192 ba_ar 4199332
37 PRJNA217481 bacteria 1345695 ba_ar 5107814
38 PRJNA20011 bacteria 452863 ba_ar 4395537
39 PRJNA186462 bacteria 1283299 ba_ar 5695238
40 PRJNA33691 bacteria 590998 ba_ar 4266344
41 PRJNA33599 bacteria 591158 ba_ar 9175669
42 PRJNA20399 bacteria 324057 ba_ar 7184930
43 PRJNA212980 bacteria 1367847 ba_ar 3613807
44 PRJNA261945 bacteria 658612 ba_ar 5790538
45 PRJNA183309 bacteria 1217712 ba_ar 4047559
46 PRJNA27951 bacteria 485918 ba_ar 9127347
47 PRJNA261104 bacteria 1406 ba_ar 5762608
48 PRJNA183018 bacteria 1121377 ba_ar 4452642
49 PRJNA238302 bacteria 652017 ba_ar 3040130
50 PRJNA42009 bacteria 694427 ba_ar 3685504
51 PRJNA80827 bacteria 1131731 ba_ar 4223247
52 PRJNA171367 bacteria 1005941 ba_ar 3426806
53 PRJNA232079 bacteria 1312959 ba_ar 3894834
54 PRJNA242829 bacteria 1927 ba_ar 8197540
55 PRJNA165395 bacteria 1121917 ba_ar 3059517
56 PRJNA81617 bacteria 979226 ba_ar 4772825
57 PRJNA13473 bacteria 323097 ba_ar 4406967
58 PRJNA186910 bacteria 1285583 ba_ar 3113488
59 PRJNA63851 bacteria 929712 ba_ar 5521807
60 PRJNA18505 archaea 456442 ba_ar 2542943
61 PRJNA20391 bacteria 315750 ba_ar 3704465
62 PRJNA12634 bacteria 290397 ba_ar 5013479
63 PRJNA17707 bacteria 404380 ba_ar 4615150
64 PRJNA232351 bacteria 1439940 ba_ar 4865289
65 PRJNA42475 bacteria 443255 ba_ar 6760392
66 PRJNA168616 bacteria 1121362 ba_ar 3135752
67 PRJNA182711 bacteria 1124780 ba_ar 4805697
68 PRJNA15771 bacteria 204669 ba_ar 5650368
69 PRJNA256039 bacteria 236814 ba_ar 4364663
70 PRJNA190819 bacteria 1298862 ba_ar 5641932
71 RNODE_165_length_1669_cov_2.37887 plasmid 45202 pl_vi_unk 16470
72 RNODE_178_length_3618_cov_3.32119 plasmid 45202 pl_vi_unk 35960
73 RNODE_201_length_2438_cov_4.66846 unknown 32644 pl_vi_unk 24160
74 RNODE_241_length_10461_cov_3.82000 virus 1214906 pl_vi_unk 104390
75 RNODE_244_length_5333_cov_28.75937 plasmid 45202 pl_vi_unk 53110
76 RNODE_250_length_5534_cov_8.34289 unknown 32644 pl_vi_unk 55120
77 RNODE_262_length_4000_cov_14.35495 plasmid 45202 pl_vi_unk 39780
78 RNODE_278_length_7999_cov_5.67456 plasmid 45202 pl_vi_unk 79770
79 RNODE_288_length_3164_cov_59.95449 plasmid 45202 pl_vi_unk 31420
80 RNODE_293_length_11038_cov_5.01480 plasmid 45202 pl_vi_unk 110160
81 RNODE_297_length_1810_cov_21.16499 plasmid 45202 pl_vi_unk 17880
82 RNODE_327_length_1948_cov_35.31308 unknown 32644 pl_vi_unk 19260
83 RNODE_329_length_4403_cov_3.14905 unknown 32644 pl_vi_unk 43810
84 RNODE_405_length_1632_cov_8.96149 plasmid 45202 pl_vi_unk 16100
85 RNODE_41_length_1982_cov_209.61480 plasmid 45202 pl_vi_unk 19600
86 RNODE_447_length_1716_cov_15.11213 plasmid 45202 pl_vi_unk 16720
87 RNODE_454_length_3059_cov_7.56632 unknown 32644 pl_vi_unk 29710
88 RNODE_45_length_2213_cov_3.74441 unknown 32644 pl_vi_unk 21910
89 RNODE_466_length_2414_cov_22.47806 plasmid 45202 pl_vi_unk 23480
90 RNODE_482_length_12608_cov_5.40546 plasmid 45202 pl_vi_unk 125640
91 RNODE_4_length_1367_cov_2.97100 unknown 32644 pl_vi_unk 13450
92 RNODE_519_length_4527_cov_27.40475 plasmid 45202 pl_vi_unk 44830
93 RNODE_53_length_5801_cov_10.05503 unknown 32644 pl_vi_unk 57790
94 RNODE_567_length_2043_cov_21.53474 unknown 32644 pl_vi_unk 19550
95 RNODE_568_length_4215_cov_4.24868 plasmid 45202 pl_vi_unk 41710
96 RNODE_595_length_1522_cov_10.57355 unknown 32644 pl_vi_unk 14780
97 RNODE_59_length_7149_cov_6.99495 plasmid 45202 pl_vi_unk 71270
98 RNODE_6_length_1955_cov_9.45784 plasmid 45202 pl_vi_unk 19330
99 RNODE_72_length_2497_cov_7.54950 unknown 32644 pl_vi_unk 24750
100 RNODE_86_length_2125_cov_11.11507 unknown 32644 pl_vi_unk 21030
If
taxid = 3702
and the taxid for the row is a strain (3702.x)matching_row
becomes 0. I hot fixed to just add it to thenot_euk
group. Correct this by making itstrain.friendly