top_taxa not properly grouping samples according to the grouping parameter

chymoncada commented 3 months ago

Hi, first of all I want to say thanks for developing such a cool tool for analyzing microbiome data! It is super helpful.

I am currently having issues with the top_taxa function. I have a phyloseq object (rel_2123_core) for which I want to calculate the most abundant families in each grouping. I have 30 groups (Date_Fraction in my phyloseq sample data) in my ps_obj, and I was hoping to get the top 20 families for each group, by using:

top_families_datefraction <- top_taxa(rel_2123_core, n_taxa = 20, tax_level = "Family", include_na_taxa = T, grouping = "Date_Fraction", by_proportion = FALSE)

I set by_proportion = FALSE since these are already relative abundances. and I get:

$ps_obj
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 21 taxa and 159 samples ]
sample_data() Sample Data:       [ 159 samples by 7 sample variables ]
tax_table()   Taxonomy Table:    [ 21 taxa by 7 taxonomic ranks ]

$top_taxa
   Date_Fraction tax_rank   taxid  abundance  Kingdom            Phylum               Class
1           <NA>        1   ASV_1 0.18584254 Bacteria      Bacteroidota         Bacteroidia
2           <NA>       11   ASV_3 0.01831254 Bacteria    Proteobacteria Alphaproteobacteria
3           <NA>        2   ASV_6 0.04227999 Bacteria    Proteobacteria Gammaproteobacteria
4           <NA>        5  ASV_17 0.02620376 Bacteria    Proteobacteria Gammaproteobacteria
5           <NA>        4  ASV_21 0.03374239 Bacteria    Proteobacteria Gammaproteobacteria
6           <NA>       14  ASV_22 0.01520183 Bacteria  Desulfobacterota       Desulfobulbia
7           <NA>        9  ASV_23 0.01953960 Bacteria  Actinobacteriota      Acidimicrobiia
8           <NA>       10  ASV_30 0.01929713 Bacteria    Proteobacteria Gammaproteobacteria
9           <NA>       12  ASV_39 0.01735695 Bacteria    Proteobacteria Gammaproteobacteria
10          <NA>        7  ASV_47 0.02162400 Bacteria    Proteobacteria Gammaproteobacteria
11          <NA>       17  ASV_57 0.01266258 Bacteria      Bacteroidota         Bacteroidia
12          <NA>        3  ASV_61 0.03454504 Bacteria    Proteobacteria Alphaproteobacteria
13          <NA>       16  ASV_70 0.01300039 Bacteria  Actinobacteriota      Acidimicrobiia
14          <NA>       19  ASV_73 0.01152922 Bacteria Verrucomicrobiota     Kiritimatiellae
15          <NA>       15  ASV_87 0.01376076 Bacteria      Bacteroidota         Bacteroidia
16          <NA>       20  ASV_88 0.01130768 Bacteria  Desulfobacterota       Desulfobulbia
17          <NA>        8  ASV_96 0.02120873 Bacteria      Bacteroidota         Bacteroidia
18          <NA>       18 ASV_110 0.01161420 Bacteria Verrucomicrobiota    Verrucomicrobiae
19          <NA>       13 ASV_121 0.01600708 Bacteria    Proteobacteria Gammaproteobacteria
20          <NA>        6 ASV_198 0.02368030 Bacteria   Planctomycetota      Planctomycetes
                                Order             Family Genus Species
1                    Flavobacteriales  Flavobacteriaceae  <NA>    <NA>
2                         SAR11 clade            Clade I  <NA>    <NA>
3                  Steroidobacterales        Woeseiaceae  <NA>    <NA>
4                       Thiotrichales     Thiotrichaceae  <NA>    <NA>
5  Gammaproteobacteria Incertae Sedis     Unknown Family  <NA>    <NA>
6                     Desulfobulbales   Desulfocapsaceae  <NA>    <NA>
7                     Actinomarinales               <NA>  <NA>    <NA>
8                     Pseudomonadales     Nitrincolaceae  <NA>    <NA>
9                    Enterobacterales      Colwelliaceae  <NA>    <NA>
10                   Enterobacterales  Psychromonadaceae  <NA>    <NA>
11                   Flavobacteriales  Crocinitomicaceae  <NA>    <NA>
12                    Rhodobacterales   Rhodobacteraceae  <NA>    <NA>
13                     Microtrichales    Microtrichaceae  <NA>    <NA>
14                  Kiritimatiellales Kiritimatiellaceae  <NA>    <NA>
15                       Cytophagales  Cyclobacteriaceae  <NA>    <NA>
16                    Desulfobulbales   Desulfobulbaceae  <NA>    <NA>
17                    Chitinophagales     Saprospiraceae  <NA>    <NA>
18                 Verrucomicrobiales     Rubritaleaceae  <NA>    <NA>
19                    Pseudomonadales         Halieaceae  <NA>    <NA>
20                       Pirellulales      Pirellulaceae  <NA>    <NA>

Which is not what I expected. I expected it to name 20 ASVs for each group. I know there are some overlaps between my groups but I know there are also unique families, so there should be more taxa if it does consider the grouping factor. I also checked this witn ntaxa=2, and also only got 2 families. The NAs under the Date_Fraction column of the result also tells me something is not right. I also tried the ps_obj with absolute abundances, set by_proportion = TRUE, and got the same result.

Am I using the top_taxa function wrong? If so, how do I correctly use the function so that it gets the top taxa for the specified group? Thanks a lot for your help!

gmteunisse commented 3 months ago

That does seem wrong. Does the example in the documentation work correctly for you?

Looking at your output, it seems that the Date_Fraction column is being converted to NA. What type is this column in your phyloseq object?

Could you provide a minimal reproducible example with your issue? This makes it a lot easier for me to inspect what is going on.

chymoncada commented 3 months ago

Yes, the GlobalPatterns example works for me. Grouping that by SampleType shows me the top n taxa for each sample type.

The Date_Fraction column is a factor, so I think that's fine?

I've emailed you a subset of my phyloseq object since I cant seem to upload .rds files here. That one has 6 different groups for the Date_Fraction parameter. Thanks!

gmteunisse commented 3 months ago

Thanks for checking that. I did not receive an email (and to be honest, I would rather not be emailed data), could you try an alternative (e.g. dput)?

In the meantime, I've checked with GlobalPatterns if there are any issues with your hypothetical setup. As far as I can tell, the function should be working as intended:

require("fantaxtic")
#> Loading required package: fantaxtic
require("phyloseq")
#> Loading required package: phyloseq
data(GlobalPatterns)
sample_data(GlobalPatterns)$SampleType <- as.factor(sample_data(GlobalPatterns)$SampleType)
top_asv <- top_taxa(GlobalPatterns, n_taxa = 2, grouping = "SampleType", tax_level="Class", by_proportion=F, include_na_taxa = T)
top_asv$top_taxa
#>            SampleType tax_rank  taxid  abundance  Kingdom          Phylum
#> 1          Freshwater        1 329744  546306.50 Bacteria  Actinobacteria
#> 2  Freshwater (creek)        1 549656 1090473.00 Bacteria   Cyanobacteria
#> 3          Freshwater        2 279599  522626.50 Bacteria   Cyanobacteria
#> 4  Freshwater (creek)        2 360229  112393.67 Bacteria  Proteobacteria
#> 5              Tongue        1 360229  381557.50 Bacteria  Proteobacteria
#> 6  Sediment (estuary)        1  94166   98640.00 Bacteria  Proteobacteria
#> 7                Skin        2  94166   81813.33 Bacteria  Proteobacteria
#> 8              Tongue        2  94166  283849.00 Bacteria  Proteobacteria
#> 9               Ocean        2 534609  216527.67 Bacteria  Proteobacteria
#> 10               Soil        1 534609  127965.33 Bacteria  Proteobacteria
#> 11 Sediment (estuary)        2 319044   92608.67 Bacteria  Proteobacteria
#> 12              Ocean        1 339261  398276.67 Bacteria   Bacteroidetes
#> 13              Feces        2 331820  607768.00 Bacteria   Bacteroidetes
#> 14               Mock        2 331820  271074.67 Bacteria   Bacteroidetes
#> 15               Soil        2 256977  111549.00 Bacteria Verrucomicrobia
#> 16              Feces        1 189047  683201.75 Bacteria      Firmicutes
#> 17               Mock        1 189047  413917.00 Bacteria      Firmicutes
#> 18               Skin        1  98605  117793.67 Bacteria      Firmicutes
#>                  Class Order Family Genus Species
#> 1       Actinobacteria  <NA>   <NA>  <NA>    <NA>
#> 2          Chloroplast  <NA>   <NA>  <NA>    <NA>
#> 3     Nostocophycideae  <NA>   <NA>  <NA>    <NA>
#> 4   Betaproteobacteria  <NA>   <NA>  <NA>    <NA>
#> 5   Betaproteobacteria  <NA>   <NA>  <NA>    <NA>
#> 6  Gammaproteobacteria  <NA>   <NA>  <NA>    <NA>
#> 7  Gammaproteobacteria  <NA>   <NA>  <NA>    <NA>
#> 8  Gammaproteobacteria  <NA>   <NA>  <NA>    <NA>
#> 9  Alphaproteobacteria  <NA>   <NA>  <NA>    <NA>
#> 10 Alphaproteobacteria  <NA>   <NA>  <NA>    <NA>
#> 11 Deltaproteobacteria  <NA>   <NA>  <NA>    <NA>
#> 12       Flavobacteria  <NA>   <NA>  <NA>    <NA>
#> 13         Bacteroidia  <NA>   <NA>  <NA>    <NA>
#> 14         Bacteroidia  <NA>   <NA>  <NA>    <NA>
#> 15      Spartobacteria  <NA>   <NA>  <NA>    <NA>
#> 16          Clostridia  <NA>   <NA>  <NA>    <NA>
#> 17          Clostridia  <NA>   <NA>  <NA>    <NA>
#> 18             Bacilli  <NA>   <NA>  <NA>    <NA>

That most likely means there is some incompatibility between your phyloseq object and fantaxtic. Could you show me your sample_data like below?

require("phyloseq")
#> Loading required package: phyloseq
require("tidyverse")
#> Loading required package: tidyverse
data(GlobalPatterns)
sample_data(GlobalPatterns) %>% data.frame(.) %>% tibble(.)
#> # A tibble: 26 × 7
#>    X.SampleID Primer  Final_Barcode Barcode_truncat… Barcode_full_le… SampleType
#>    <fct>      <fct>   <fct>         <fct>            <fct>            <fct>     
#>  1 CL3        ILBC_01 AACGCA        TGCGTT           CTAGCGTGCGT      Soil      
#>  2 CC1        ILBC_02 AACTCG        CGAGTT           CATCGACGAGT      Soil      
#>  3 SV1        ILBC_03 AACTGT        ACAGTT           GTACGCACAGT      Soil      
#>  4 M31Fcsw    ILBC_04 AAGAGA        TCTCTT           TCGACATCTCT      Feces     
#>  5 M11Fcsw    ILBC_05 AAGCTG        CAGCTT           CGACTGCAGCT      Feces     
#>  6 M31Plmr    ILBC_07 AATCGT        ACGATT           CGAGTCACGAT      Skin      
#>  7 M11Plmr    ILBC_08 ACACAC        GTGTGT           GCCATAGTGTG      Skin      
#>  8 F21Plmr    ILBC_09 ACACAT        ATGTGT           GTAGACATGTG      Skin      
#>  9 M31Tong    ILBC_10 ACACGA        TCGTGT           TGTGGCTCGTG      Tongue    
#> 10 M11Tong    ILBC_11 ACACGG        CCGTGT           TAGACACCGTG      Tongue    
#> # … with 16 more rows, and 1 more variable: Description <fct>

^{Created on 2024-07-08 by the reprex package (v2.0.1)}

chymoncada commented 3 months ago

Here is how to recreate small subset of my phyloseq object (10 samples, 5 ASVs)

> dput(ps_subset)
new("phyloseq", otu_table = new("otu_table", .Data = structure(c(0.0294286699620646, 
0.000143694677549144, 0.0129612599149328, 0.000258650419588458, 
0.0108633176227153, 0.0345451362548871, 0.000233413082803291, 
0.0143549045924024, 0.000116706541401646, 0.012691836377429, 
0.0117188644420356, 0.000439457416576334, 0.0129493452084493, 
0.000380863094366156, 0.00339847068819031, 0.000967702915205032, 
0.0518124269182694, 0.000846740050804403, 0.0480625781218499, 
0.000524172412402726, 0.0082013108922631, 0.000988109746055795, 
0.0110009551727545, 0.000955172754520602, 0.00260202233128026, 
0.0338230269900922, 0.000310587942976054, 0.0139764574339224, 
0.000341646737273659, 0.0123924589247445, 0.0414228879007723, 
0.00014041656915516, 0.0172712380060847, 0.00016381933068102, 
0.0139012403463609, 0.0121970663265306, 0.000637755102040816, 
0.0152264030612245, 0.00123565051020408, 0.00418526785714286, 
0.000353186707336651, 0.0500882966768342, 0.000417402472306951, 
0.0768341627869642, 0.000288970942366351, 0.00403329891584925, 
0.00622741352607124, 0.000677594217862674, 0.000903458957150232, 
0.00245224574083634), dim = c(5L, 10L), dimnames = list(c("ASV_1", 
"ASV_3", "ASV_4", "ASV_5", "ASV_6"), c("08-bulk", "08-FA", "08-LA", 
"08-OSW", "08-PW", "09-bulk", "09-FA", "09-LA", "09-OSW", "09-PW"
))), taxa_are_rows = TRUE), tax_table = new("taxonomyTable", 
    .Data = structure(c("Bacteria", "Bacteria", "Bacteria", "Bacteria", 
    "Bacteria", "Bacteroidota", "Proteobacteria", "Bacteroidota", 
    "Bacteroidota", "Proteobacteria", "Bacteroidia", "Alphaproteobacteria", 
    "Bacteroidia", "Bacteroidia", "Gammaproteobacteria", "Flavobacteriales", 
    "SAR11 clade", "Flavobacteriales", "Flavobacteriales", "Steroidobacterales", 
    "Flavobacteriaceae", "Clade I", "Flavobacteriaceae", "Flavobacteriaceae", 
    "Woeseiaceae", "Maribacter", "Clade Ia", "Aquibacter", "Polaribacter", 
    "Woeseia", NA, NA, NA, NA, NA), dim = c(5L, 7L), dimnames = list(
        c("ASV_1", "ASV_3", "ASV_4", "ASV_5", "ASV_6"), c("Kingdom", 
        "Phylum", "Class", "Order", "Family", "Genus", "Species"
        )))), sam_data = new("sample_data", .Data = list(c("08-bulk", 
"08-FA", "08-LA", "08-OSW", "08-PW", "09-bulk", "09-FA", "09-LA", 
"09-OSW", "09-PW"), c("Sediment_08", "Sediment_08", "Sediment_08", 
"Seawater_08", "Sediment_08", "Sediment_09", "Sediment_09", "Sediment_09", 
"Seawater_09", "Sediment_09"), c("Bulk", "FA", "LA", "OSW", "PW", 
"Bulk", "FA", "LA", "OSW", "PW"), structure(c(1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L), levels = "2204", class = "factor"), 
    structure(c(5L, 4L, 3L, 1L, 2L, 5L, 4L, 3L, 1L, 2L), levels = c("2204_B_OSW", 
    "2204_C_PW", "2204_D_LA", "2204_E_FA", "2204_F_Bulk"), class = "factor"), 
    c("Zymo", "Zymo", "Zymo", "Zymo", "Zymo", "Zymo", "Zymo", 
    "Zymo", "Zymo", "Zymo"), c("No", "No", "No", "No", "No", 
    "No", "No", "No", "No", "No")), names = c("Sample_name", 
"Sample_group", "Fraction", "Sampling_date", "Date_Fraction", 
"Extraction", "For_RNA"), row.names = c("08-bulk", "08-FA", "08-LA", 
"08-OSW", "08-PW", "09-bulk", "09-FA", "09-LA", "09-OSW", "09-PW"
), .S3Class = "data.frame"), phy_tree = NULL, refseq = NULL)

My sample_data looks like:


> sample_data(rel_2123_core) %>% data.frame(.) %>% tibble(.)
# A tibble: 159 × 7
   Sample_name  Sample_group Fraction Sampling_date Date_Fraction Extraction For_RNA
   <chr>        <chr>        <chr>    <fct>         <fct>         <chr>      <chr>  
 1 01-bulk4-Zy  Sediment_1   Bulk     2112          2112_F_Bulk   Zymo       No     
 2 01-FA1-Zy    Sediment_1   FA       2112          2112_E_FA     Zymo       No     
 3 01-LA1       Sediment_1   LA       2112          2112_D_LA     Powerwater No     
 4 01-OSW1      Seawater_1   OSW      2112          2112_B_OSW    Powerwater No     
 5 01-PW1       Sediment_1   PW       2112          2112_C_PW     Powerwater No     
 6 01-SSW1      Seawater_1   SSW      2112          2112_A_SSW    Powerwater No     
 7 02-bulk21-Zy Sediment_2   Bulk     2112          2112_F_Bulk   Zymo       No     
 8 02-FA11-Zy   Sediment_2   FA       2112          2112_E_FA     Zymo       No     
 9 02-LA6       Sediment_2   LA       2112          2112_D_LA     Powerwater No     
10 02-OSW6      Seawater_2   OSW      2112          2112_B_OSW    Powerwater No     
# ℹ 149 more rows
# ℹ Use `print(n = ...)` to see more rows

Thanks again for looking into this!

gmteunisse commented 3 months ago

I can't reproduce your problem. Would you mind updating restarting your R version, reinstalling fantaxtic to the latest version, and check if the issue persists?

chymoncada commented 3 months ago

Hi, that resolved the issue! Thanks a lot for your time!

gmteunisse / fantaxtic

top_taxa not properly grouping samples according to the grouping parameter #36