joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
584 stars 187 forks source link

probable bug plot_frequency #1489

Closed wsteenhu closed 3 years ago

wsteenhu commented 3 years ago

Recently I have been exploring the plot_frequency()-function in a bit more depth and I found two issues:

  1. Sometimes the function yields an error when the neg-parameter is set (this does not occur with neg = NULL (default)).

Example (based on tutorial):

ps <- readRDS(system.file("extdata", "MUClite.rds", package="decontam"))

sample_data(ps)$is.neg <- sample_data(ps)$Sample_or_Control == "Control Sample"

plot_frequency(ps, taxa_names(ps)[c(1, 3)], conc = "quant_reading", neg = "is.neg")
# Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
# NA/NaN/Inf in 'y'

I found out this has to do with a few lines of code in the plot_frequency()-function, namely:

...
    for (tax in names(mod_melts)) {
                newdata <- data.frame(logc = logc, taxa = tax, DNA_conc = exp(logc))
                freq <- mod_melts[[tax]]$taxon_abundance
                conc <- mod_melts[[tax]]$DNA_conc
                df <- data.frame(logc = log(conc), logf = log(freq))
                df <- df[!neg | is.na(neg), ]
                df <- df[freq > 0, ]
...

In essence, the freq-object is similar in length to the number of rows in mod_melts (reflecting all samples/controls). This is also the basis for df. However, the following two lines puzzle me, as first the negative controls are excluded from df (reducing the number of rows), after which the - original length - freq-object is used for further selection (freq > 0). These dimensions do not match and will introduce NA-rows, resulting in the error above. I figured this does not happen when samples in the original phyloseq are ordered in such a way that all samples are followed by all controls.

To test this idea I reordered the original phyloseq:

ps_reorder <- ps
otu_table(ps_reorder) <- otu_table(ps_reorder)[order(sample_data(ps_reorder)$is.neg),] # controls last

plot_frequency(ps_reorder, taxa_names(ps_reorder)[c(1, 2)], conc = "quant_reading", neg = "is.neg")
# no error

Schermafbeelding 2021-07-28 om 12 29 42

  1. Following up on 1., I found that setting the neg-parameter results in model behaviour that deviates from the idea of is.Contaminant. As far as I understood from the paper, the frequency-method uses all density information regardless whether a sample is a biological sample/control sample. However, when setting the neg-parameter, control samples are excluded from the model (e.g. the red line only represents a fit between DNA concentration and frequency in biological samples). This is not the case when not setting the neg-parameter.

Example (continuing from above):

p1 <- plot_frequency(ps_reorder, taxa_names(ps_reorder)[c(1, 2)], conc = "quant_reading", neg = "is.neg")
# Does not include blanks for models; does not align with is.Contaminant()-function

p2 <- plot_frequency(ps_reorder, taxa_names(ps_reorder)[c(1, 2)], conc = "quant_reading")
# Does include blanks for models

cowplot::plot_grid(p1, p2, ncol = 1, align = "v", axis = "lr")

Schermafbeelding 2021-07-28 om 12 25 59

Compare the red line between upper (neg-parameter set) and lower (neg-parameter not set) plots. It is slightly different, showing that only in the case of p2 the blanks were included in the modelling (as I believe should be the case, given the idea of is.Contaminant()).

Curious to hear your thoughts on this and happy to help adjust if needed.

wsteenhu commented 3 years ago

Apologies, should have posted on the decontam-Github.