liberjul / CONSTAXv2

MIT License
8 stars 2 forks source link

combined_ to constax_taxonomy.txt alignment #11

Closed Gian77 closed 1 year ago

Gian77 commented 1 year ago

Hi @liberjul

Would you mind checking this results I am getting with the UNITE eukaryotic database.

at Phylum level of the constax_taxonomy.txt I have these labels

> levels(as.factor(constax_ITS$Phylum))
 [1] ""                              "Agaricomycetes_1"              "Annelida_1"                    "Anthophyta_1"                  "Aphelidiomycota_1"            
 [6] "Arthropoda_1"                  "Ascomycota_1"                  "Basidiobolomycota_1"           "Basidiomycota_1"               "Blastocladiomycota_1"         
[11] "Bryophyta_1"                   "Calcarisporiellomycota_1"      "Cercozoa_1"                    "Choanoflagellata_1"            "Chrysosporium lobatum "       
[16] "Chytridiomycota_1"             "Cladosporium cladosporioides " "Cnidaria_1"                    "Collembola_1"                  "Entomophthoromycota_1"        
[21] "Entorrhizomycota_1"            "Fusarium equiseti "            "Gastrotricha_1"                "Glomerales_1"                  "Glomeromycetes_1"             
[26] "Glomeromycota_1"               "Hydnaceae_1"                   "Kickxellomycota_1"             "Lecanoromycetes_1"             "Monoblepharomycota_1"         
[31] "Mortierellomycota_1"           "Mucoromycota_1"                "Neocallimastigomycota_1"       "Neonectria candida "           "Olpidiomycota_1"              
[36] "Phlyctidaceae_1"               "Platyhelminthes_1"             "Rotifera_1"                    "Rozellomycota_1"               "Solicoccozyma aeria "         
[41] "Truncatella angustata "        "Turbellaria_1"                 "Zoopagomycota_1"    

For example FOTU_104 has Solicoccozyma aeria at Phylum level.

> constax_ITS %>% 
+ rownames_to_column("OTU_ID") %>% 
+   filter(OTU_ID == "FOTU_104")
    OTU_ID Kingdom               Phylum Class Order Family Genus Species Isolate Isolate_percent_id Isolate_query_cover High_level_taxonomy HL_hit_percent_id
1 FOTU_104 Fungi_1 Solicoccozyma aeria                                                            0                   0               Fungi               100
  HL_hit_query_cover
1                100

But if I check the combined_taxonomy.txt it should mark this FOTU as not classified at Phyluem level. Am I right?

> combined_ITS_taxonomy %>% 
+   rownames_to_column("OTU_ID") %>% 
+   filter(OTU_ID == "FOTU_104")
    OTU_ID Kingdom_RDP Kingdom_BLAST Kingdom_SINTAX Kingdom_Consensus      Phylum_RDP   Phylum_BLAST Phylum_SINTAX Phylum_Consensus         Class_RDP    Class_BLAST
1 FOTU_104     Fungi_1       Fungi_1        Fungi_1           Fungi_1 Basidiomycota_1 Incertae_sedis                                Tremellomycetes_1 Incertae_sedis
  Class_SINTAX Class_Consensus        Order_RDP    Order_BLAST Order_SINTAX Order_Consensus        Family_RDP   Family_BLAST Family_SINTAX Family_Consensus
1                              Filobasidiales_1 Incertae_sedis                              Piskurozymaceae_1 Incertae_sedis                               
        Genus_RDP    Genus_BLAST Genus_SINTAX Genus_Consensus          Species_RDP Species_BLAST Species_SINTAX    Species_Consensus
1 Solicoccozyma_1 Incertae_sedis                              Solicoccozyma aeria                            NA Solicoccozyma aeria 

Do you mind double chacking what's going on?

I am attaching a zipped test.fasta with a few FOTU that had this weird classification. Thanks,

G.

test.zip

liberjul commented 1 year ago

That is interesting, and maybe it has to do with the lack of phylum classifications for BLAST and SINTAX. I'll investigate.

liberjul commented 1 year ago

@Gian77 can you upload your output directory? I assume that the issue will be in CombineTaxonomy.py.

Gian77 commented 1 year ago

@liberjul

Cool! Going to send by email ;)

liberjul commented 1 year ago

Fixed with an update to the vote() function in CombineTaxonomy.py. https://github.com/liberjul/CONSTAXv2/commit/df0a66ccae3c6a36e33b309bfaf7d451b68e80fc