ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

CAMISIM: Fix Rscript get_data.R #26

Closed ndreey closed 1 year ago

ndreey commented 1 year ago

If taxid = 3702 and the taxid for the row is a strain (3702.x) matching_row becomes 0. I hot fixed to just add it to the not_euk group. Correct this by making it strain.friendly

 # Get taxonomic group ID, V3 = TAXPATH, V1 = TAXID
 # Find the row where V1 column matches taxid exactly
  matching_row <- grep(paste0("^", taxid, "$"), taxonomic_profile$V1)

  if (length(matching_row) == 0) {
    # add taxid to fix_list and move to next iteration
    fix_list <- c(fix_list, taxid)
    list_tax_group <- c(list_tax_group, 2)
    list_group <- c(list_group, "not_euk")
    next
  }
ndreey commented 1 year ago

Actually, i will look into NCBITaxa, taxize and other taxonomy R packages and see if i can run the taxids through there to separate fungi, orchid and not_euk.

ndreey commented 1 year ago

Correction

I am using the taxonomic_profile_1.txt as a taxonomy database to map taxid with a superkingdom. There is no need to get info on the strain taxid as they are an artifact from a stimulated run. taxonomic_profile_1.txt is generated during CAMISIM.

I removed the thaliana row and added these lines to taxonomic_profile_1.txt

2320716 species 2759|2320716    Eukaryota|Platanthera zijinensis    0.0161      
156515  species 2759|156515 Eukaryota|Tulasnella calospora  0.0161      
1287689 species 2759|1287689    Eukaryota|Rhizoctonia solani    0.0161      
305860  species 2759|305860 Eukaryota|Ceratobasidium sp. AG-I   0.0161      

The script is generating 897 rows now which is my total number of genomes now.

What i need to do now is make an ìf statement` that groups P. zijinensis in its own group so i get (orchid, euk, not_euk).

Might just change the group names to (orchid, fungi, not_euk).

ndreey commented 1 year ago

I will close this issue when get_data groups them into three groups.

ndreey commented 1 year ago

subset_data.R creates this data frame as of March 15.

                                       genome_id   taxid     rank       size
1                     Platanthera_zijinensis_chr 2320716     host 4186550321
2                 Phoma_radicina_MPI-SP2-AT-0466  565429    fungi   38215437
3          Aspergillus_fumigatus_MPI-SW4-AT-0569  746128    fungi   47521149
4           Sarocladium_strictum_MPI-IT2-AT-0306    5046    fungi   31483859
5       Pyrenochaeta_lycopersici_MPI-FR1-AT-0381  285811    fungi   54218686
6          Peyronellaea_curtisii_MPI-SP2-AT-0415  749631    fungi   36822598
7       Talaromyces_verruculosus_MPI-SP2-AT-0411  198730    fungi   32407473
8          Penicillium_canescens_MPI-SW4-AT-0573    5083    fungi   32967678
9   Cylindrocarpon_pauciseptatum_MPI-SP2-AT-0468  465806    fungi   60797881
10          Verticillium_dahliae_MPI-FR1-AT-0353   27337    fungi   36758448
11            Alternaria_tenuis_MPI-SDFR-AT-0071    5599    fungi   34443718
12           Fusarium_oxysporum_MPI-CAGE-AT-0013    5507    fungi   56548958
13       Cladosporium_rectoides_MPI-GEGE-AT-0032  887101    fungi   33942924
14      Gibellulopsis_nigrescens_MPI-SP2-AT-0410  796325    fungi   35340245
15  Metacordyceps_chlamydosporia_MPI-IT2-AT-0323  280754    fungi   43916484
16       Ochroconis_tshawytschae_MPI-SP2-AT-0416  262132    fungi   36199058
17          Phomopsis_columnaris_MPI-SP2-AT-0504  193000    fungi   88466628
18       Umbelopsis_autotrophica_MPI-SW4-AT-0611  979767    fungi   27968122
19      Embellisia_chlamydospora_MPI-FR1-AT-0336  247032    fungi   37793906
20        Phialocephala_fortinii_MPI-SW4-AT-0551   62722    fungi   76760107
21         Phaeosphaeria_eustoma_MPI-FR1-AT-0339   85909    fungi   51461602
22         Truncatella_angustata_MPI-SW4-AT-0541  152316    fungi   58029377
23       Macrophomina_phaseolina_MPI-FR1-AT-0330   35725    fungi   58552437
24             Rhizopycnis_vagum_MPI-FR1-AT-0346 1589764    fungi   53344606
25   Paraphoma_chrysanthemicola_MPI-SDFR-AT-0093  798071    fungi   42351520
26        Neonectria_radicicola_MPI-CAGE-CH-0236   64609    fungi   75485037
27       Cryptosporiopsis_ericae_MPI-SW4-AT-0549 1663492    fungi   64936402
28    Leptodontidium_orchidicola_MPI-SW4-AT-0643 1732013    fungi   79839788
29           Myrothecium_cinctum_MPI-SP2-AT-0408 1860054    fungi   44683684
30      Ilyonectria_macrodidyma_MPI-GEGE-AT-0033  307937    fungi   74816530
31          Hypocrea_atroviridis_MPI-SP2-AT-0434   63577    fungi   39107378
32   Plectosphaerella_cucumerina_MPI-FR1-AT-0340   40658    fungi   39476090
33                  Tulasnella_calospora_Tulcal1  156515      OMF   70372952
34                      Ceratobasidium_sp_CerAGI  305860      OMF   58444101
35                   Rhizoctonia_solani_Rhisola1 1287689      OMF   39822884
36                                      Otu721.0  627192 bacteria    4199332
37                                        Otu967 1345695 bacteria    5107814
38                                        Otu876  452863 bacteria    4395537
39                                      Otu398.0 1283299 bacteria    5695238
40                                     Otu1003.0  590998 bacteria    4266344
41                                       Otu1093  591158 bacteria    9175669
42                                       Otu1015  324057 bacteria    7184930
43                                      Otu633.0 1367847 bacteria    3613807
44                                       Otu31.0  658612 bacteria    5790538
45                                        Otu902 1217712 bacteria    4047559
46                                      Otu480.0  485918 bacteria    9127347
47                                        Otu884    1406 bacteria    5762608
48                                        Otu977 1121377 bacteria    4452642
49                                        Otu983  652017 bacteria    3040130
50                                        Otu997  694427 bacteria    3685504
51                                       Otu1016 1131731 bacteria    4223247
52                                        Otu856 1005941 bacteria    3426806
53                                      Otu890.0 1312959 bacteria    3894834
54                                        Otu964    1927 bacteria    8197540
55                                        Otu831 1121917 bacteria    3059517
56                                      Otu560.0  979226 bacteria    4772825
57                                        Otu981  323097 bacteria    4406967
58                                      Otu934.0 1285583 bacteria    3113488
59                                     Otu1072.0  929712 bacteria    5521807
60                                        Otu813  456442  archaea    2542943
61                                        Otu905  315750 bacteria    3704465
62                                       Otu1059  290397 bacteria    5013479
63                                     Otu1044.0  404380 bacteria    4615150
64                                        Otu872 1439940 bacteria    4865289
65                                        Otu935  443255 bacteria    6760392
66                                        Otu931 1121362 bacteria    3135752
67                                      Otu258.0 1124780 bacteria    4805697
68                                      Otu981.0  204669 bacteria    5650368
69                                      Otu618.0  236814 bacteria    4364663
70                                      Otu675.0 1298862 bacteria    5641932
71             RNODE_165_length_1669_cov_2.37887   45202  plasmid      16470
72             RNODE_178_length_3618_cov_3.32119   45202  plasmid      35960
73             RNODE_201_length_2438_cov_4.66846   32644  unknown      24160
74            RNODE_241_length_10461_cov_3.82000 1214906    virus     104390
75            RNODE_244_length_5333_cov_28.75937   45202  plasmid      53110
76             RNODE_250_length_5534_cov_8.34289   32644  unknown      55120
77            RNODE_262_length_4000_cov_14.35495   45202  plasmid      39780
78             RNODE_278_length_7999_cov_5.67456   45202  plasmid      79770
79            RNODE_288_length_3164_cov_59.95449   45202  plasmid      31420
80            RNODE_293_length_11038_cov_5.01480   45202  plasmid     110160
81            RNODE_297_length_1810_cov_21.16499   45202  plasmid      17880
82            RNODE_327_length_1948_cov_35.31308   32644  unknown      19260
83             RNODE_329_length_4403_cov_3.14905   32644  unknown      43810
84             RNODE_405_length_1632_cov_8.96149   45202  plasmid      16100
85            RNODE_41_length_1982_cov_209.61480   45202  plasmid      19600
86            RNODE_447_length_1716_cov_15.11213   45202  plasmid      16720
87             RNODE_454_length_3059_cov_7.56632   32644  unknown      29710
88              RNODE_45_length_2213_cov_3.74441   32644  unknown      21910
89            RNODE_466_length_2414_cov_22.47806   45202  plasmid      23480
90            RNODE_482_length_12608_cov_5.40546   45202  plasmid     125640
91               RNODE_4_length_1367_cov_2.97100   32644  unknown      13450
92            RNODE_519_length_4527_cov_27.40475   45202  plasmid      44830
93             RNODE_53_length_5801_cov_10.05503   32644  unknown      57790
94            RNODE_567_length_2043_cov_21.53474   32644  unknown      19550
95             RNODE_568_length_4215_cov_4.24868   45202  plasmid      41710
96            RNODE_595_length_1522_cov_10.57355   32644  unknown      14780
97              RNODE_59_length_7149_cov_6.99495   45202  plasmid      71270
98               RNODE_6_length_1955_cov_9.45784   45202  plasmid      19330
99              RNODE_72_length_2497_cov_7.54950   32644  unknown      24750
100            RNODE_86_length_2125_cov_11.11507   32644  unknown      21030
ndreey commented 1 year ago

source_genomes/ contains: 894 genomes

Mock Data contains 100 genomes and is a subset of the 897 genomes.

TOTAL: 100

I will close #26 after confirmation with supervisor.

I will close this issue when get_data groups them into three groups.

ndreey commented 1 year ago

32

Updated the script to keep the BioProj references instead of giving them the "Otu" names (believe they are made up). This is the references bacteria/archaea has.

36                                    PRJNA67115  627192 bacteria    4199332
37                                   PRJNA217481 1345695 bacteria    5107814
38                                    PRJNA20011  452863 bacteria    4395537
39                                   PRJNA186462 1283299 bacteria    5695238
40                                    PRJNA33691  590998 bacteria    4266344
41                                    PRJNA33599  591158 bacteria    9175669
42                                    PRJNA20399  324057 bacteria    7184930
43                                   PRJNA212980 1367847 bacteria    3613807
44                                   PRJNA261945  658612 bacteria    5790538
45                                   PRJNA183309 1217712 bacteria    4047559
46                                    PRJNA27951  485918 bacteria    9127347
47                                   PRJNA261104    1406 bacteria    5762608
48                                   PRJNA183018 1121377 bacteria    4452642
49                                   PRJNA238302  652017 bacteria    3040130
50                                    PRJNA42009  694427 bacteria    3685504
51                                    PRJNA80827 1131731 bacteria    4223247
52                                   PRJNA171367 1005941 bacteria    3426806
53                                   PRJNA232079 1312959 bacteria    3894834
54                                   PRJNA242829    1927 bacteria    8197540
55                                   PRJNA165395 1121917 bacteria    3059517
56                                    PRJNA81617  979226 bacteria    4772825
57                                    PRJNA13473  323097 bacteria    4406967
58                                   PRJNA186910 1285583 bacteria    3113488
59                                    PRJNA63851  929712 bacteria    5521807
60                                    PRJNA18505  456442  archaea    2542943
61                                    PRJNA20391  315750 bacteria    3704465
62                                    PRJNA12634  290397 bacteria    5013479
63                                    PRJNA17707  404380 bacteria    4615150
64                                   PRJNA232351 1439940 bacteria    4865289
65                                    PRJNA42475  443255 bacteria    6760392
66                                   PRJNA168616 1121362 bacteria    3135752
67                                   PRJNA182711 1124780 bacteria    4805697
68                                    PRJNA15771  204669 bacteria    5650368
69                                   PRJNA256039  236814 bacteria    4364663
70                                   PRJNA190819 1298862 bacteria    5641932
ndreey commented 1 year ago

New update on the mock_df structure (March 16) #34 .

                                       genome_id     rank   taxid     group       size
1                     Platanthera_zijinensis_chr   orchid 2320716      host 4186550321
2                 Phoma_radicina_MPI-SP2-AT-0466    fungi  565429    rfungi   38215437
3          Aspergillus_fumigatus_MPI-SW4-AT-0569    fungi  746128    rfungi   47521149
4           Sarocladium_strictum_MPI-IT2-AT-0306    fungi    5046    rfungi   31483859
5       Pyrenochaeta_lycopersici_MPI-FR1-AT-0381    fungi  285811    rfungi   54218686
6          Peyronellaea_curtisii_MPI-SP2-AT-0415    fungi  749631    rfungi   36822598
7       Talaromyces_verruculosus_MPI-SP2-AT-0411    fungi  198730    rfungi   32407473
8          Penicillium_canescens_MPI-SW4-AT-0573    fungi    5083    rfungi   32967678
9   Cylindrocarpon_pauciseptatum_MPI-SP2-AT-0468    fungi  465806    rfungi   60797881
10          Verticillium_dahliae_MPI-FR1-AT-0353    fungi   27337    rfungi   36758448
11            Alternaria_tenuis_MPI-SDFR-AT-0071    fungi    5599    rfungi   34443718
12           Fusarium_oxysporum_MPI-CAGE-AT-0013    fungi    5507    rfungi   56548958
13       Cladosporium_rectoides_MPI-GEGE-AT-0032    fungi  887101    rfungi   33942924
14      Gibellulopsis_nigrescens_MPI-SP2-AT-0410    fungi  796325    rfungi   35340245
15  Metacordyceps_chlamydosporia_MPI-IT2-AT-0323    fungi  280754    rfungi   43916484
16       Ochroconis_tshawytschae_MPI-SP2-AT-0416    fungi  262132    rfungi   36199058
17          Phomopsis_columnaris_MPI-SP2-AT-0504    fungi  193000    rfungi   88466628
18       Umbelopsis_autotrophica_MPI-SW4-AT-0611    fungi  979767    rfungi   27968122
19      Embellisia_chlamydospora_MPI-FR1-AT-0336    fungi  247032    rfungi   37793906
20        Phialocephala_fortinii_MPI-SW4-AT-0551    fungi   62722    rfungi   76760107
21         Phaeosphaeria_eustoma_MPI-FR1-AT-0339    fungi   85909    rfungi   51461602
22         Truncatella_angustata_MPI-SW4-AT-0541    fungi  152316    rfungi   58029377
23       Macrophomina_phaseolina_MPI-FR1-AT-0330    fungi   35725    rfungi   58552437
24             Rhizopycnis_vagum_MPI-FR1-AT-0346    fungi 1589764    rfungi   53344606
25   Paraphoma_chrysanthemicola_MPI-SDFR-AT-0093    fungi  798071    rfungi   42351520
26        Neonectria_radicicola_MPI-CAGE-CH-0236    fungi   64609    rfungi   75485037
27       Cryptosporiopsis_ericae_MPI-SW4-AT-0549    fungi 1663492    rfungi   64936402
28    Leptodontidium_orchidicola_MPI-SW4-AT-0643    fungi 1732013    rfungi   79839788
29           Myrothecium_cinctum_MPI-SP2-AT-0408    fungi 1860054    rfungi   44683684
30      Ilyonectria_macrodidyma_MPI-GEGE-AT-0033    fungi  307937    rfungi   74816530
31          Hypocrea_atroviridis_MPI-SP2-AT-0434    fungi   63577    rfungi   39107378
32   Plectosphaerella_cucumerina_MPI-FR1-AT-0340    fungi   40658    rfungi   39476090
33                  Tulasnella_calospora_Tulcal1    fungi  156515       OMF   70372952
34                      Ceratobasidium_sp_CerAGI    fungi  305860       OMF   58444101
35                   Rhizoctonia_solani_Rhisola1    fungi 1287689       OMF   39822884
36                                    PRJNA67115 bacteria  627192     ba_ar    4199332
37                                   PRJNA217481 bacteria 1345695     ba_ar    5107814
38                                    PRJNA20011 bacteria  452863     ba_ar    4395537
39                                   PRJNA186462 bacteria 1283299     ba_ar    5695238
40                                    PRJNA33691 bacteria  590998     ba_ar    4266344
41                                    PRJNA33599 bacteria  591158     ba_ar    9175669
42                                    PRJNA20399 bacteria  324057     ba_ar    7184930
43                                   PRJNA212980 bacteria 1367847     ba_ar    3613807
44                                   PRJNA261945 bacteria  658612     ba_ar    5790538
45                                   PRJNA183309 bacteria 1217712     ba_ar    4047559
46                                    PRJNA27951 bacteria  485918     ba_ar    9127347
47                                   PRJNA261104 bacteria    1406     ba_ar    5762608
48                                   PRJNA183018 bacteria 1121377     ba_ar    4452642
49                                   PRJNA238302 bacteria  652017     ba_ar    3040130
50                                    PRJNA42009 bacteria  694427     ba_ar    3685504
51                                    PRJNA80827 bacteria 1131731     ba_ar    4223247
52                                   PRJNA171367 bacteria 1005941     ba_ar    3426806
53                                   PRJNA232079 bacteria 1312959     ba_ar    3894834
54                                   PRJNA242829 bacteria    1927     ba_ar    8197540
55                                   PRJNA165395 bacteria 1121917     ba_ar    3059517
56                                    PRJNA81617 bacteria  979226     ba_ar    4772825
57                                    PRJNA13473 bacteria  323097     ba_ar    4406967
58                                   PRJNA186910 bacteria 1285583     ba_ar    3113488
59                                    PRJNA63851 bacteria  929712     ba_ar    5521807
60                                    PRJNA18505  archaea  456442     ba_ar    2542943
61                                    PRJNA20391 bacteria  315750     ba_ar    3704465
62                                    PRJNA12634 bacteria  290397     ba_ar    5013479
63                                    PRJNA17707 bacteria  404380     ba_ar    4615150
64                                   PRJNA232351 bacteria 1439940     ba_ar    4865289
65                                    PRJNA42475 bacteria  443255     ba_ar    6760392
66                                   PRJNA168616 bacteria 1121362     ba_ar    3135752
67                                   PRJNA182711 bacteria 1124780     ba_ar    4805697
68                                    PRJNA15771 bacteria  204669     ba_ar    5650368
69                                   PRJNA256039 bacteria  236814     ba_ar    4364663
70                                   PRJNA190819 bacteria 1298862     ba_ar    5641932
71             RNODE_165_length_1669_cov_2.37887  plasmid   45202 pl_vi_unk      16470
72             RNODE_178_length_3618_cov_3.32119  plasmid   45202 pl_vi_unk      35960
73             RNODE_201_length_2438_cov_4.66846  unknown   32644 pl_vi_unk      24160
74            RNODE_241_length_10461_cov_3.82000    virus 1214906 pl_vi_unk     104390
75            RNODE_244_length_5333_cov_28.75937  plasmid   45202 pl_vi_unk      53110
76             RNODE_250_length_5534_cov_8.34289  unknown   32644 pl_vi_unk      55120
77            RNODE_262_length_4000_cov_14.35495  plasmid   45202 pl_vi_unk      39780
78             RNODE_278_length_7999_cov_5.67456  plasmid   45202 pl_vi_unk      79770
79            RNODE_288_length_3164_cov_59.95449  plasmid   45202 pl_vi_unk      31420
80            RNODE_293_length_11038_cov_5.01480  plasmid   45202 pl_vi_unk     110160
81            RNODE_297_length_1810_cov_21.16499  plasmid   45202 pl_vi_unk      17880
82            RNODE_327_length_1948_cov_35.31308  unknown   32644 pl_vi_unk      19260
83             RNODE_329_length_4403_cov_3.14905  unknown   32644 pl_vi_unk      43810
84             RNODE_405_length_1632_cov_8.96149  plasmid   45202 pl_vi_unk      16100
85            RNODE_41_length_1982_cov_209.61480  plasmid   45202 pl_vi_unk      19600
86            RNODE_447_length_1716_cov_15.11213  plasmid   45202 pl_vi_unk      16720
87             RNODE_454_length_3059_cov_7.56632  unknown   32644 pl_vi_unk      29710
88              RNODE_45_length_2213_cov_3.74441  unknown   32644 pl_vi_unk      21910
89            RNODE_466_length_2414_cov_22.47806  plasmid   45202 pl_vi_unk      23480
90            RNODE_482_length_12608_cov_5.40546  plasmid   45202 pl_vi_unk     125640
91               RNODE_4_length_1367_cov_2.97100  unknown   32644 pl_vi_unk      13450
92            RNODE_519_length_4527_cov_27.40475  plasmid   45202 pl_vi_unk      44830
93             RNODE_53_length_5801_cov_10.05503  unknown   32644 pl_vi_unk      57790
94            RNODE_567_length_2043_cov_21.53474  unknown   32644 pl_vi_unk      19550
95             RNODE_568_length_4215_cov_4.24868  plasmid   45202 pl_vi_unk      41710
96            RNODE_595_length_1522_cov_10.57355  unknown   32644 pl_vi_unk      14780
97              RNODE_59_length_7149_cov_6.99495  plasmid   45202 pl_vi_unk      71270
98               RNODE_6_length_1955_cov_9.45784  plasmid   45202 pl_vi_unk      19330
99              RNODE_72_length_2497_cov_7.54950  unknown   32644 pl_vi_unk      24750
100            RNODE_86_length_2125_cov_11.11507  unknown   32644 pl_vi_unk      21030