Prevalence subsets have different reads than original phyloseq

alopgar commented 5 years ago

Hi! I tried PIME package using this pipeline:

pime.oob.error(phyloseq, "phenotype")
phy_BrDi <- pime.split.by.variable(phyloseq, "phenotype")
prev_BrDi <- pime.prevalence(phy_BrDi)
set.seed(30)
prev_BrDi_best <- pime.best.prevalence(prev_BrDi, "phenotype")
phyloseq_filtered <- prev_BrDi$`65`

But I faced with some questions:

If I understood correctly, a PIME prevalence interval of 65% means that the subset includes all OTUs with non-zero counts in a minimum of 65% of samples (i.e., remove OTUs with more than 35% of zeros).
If this is true, the prevalence 65% subset retains more OTUs (in my case, 1203 of 1246) than my manually implemented filtering (which removes OTUs with more than 35% zero-counts) (1099 of 1246 OTUs).

## My manual prevalence filter:
non_zero_counts <- apply(otu_table(phyloseq)@.Data, 1, function(c)sum(c!=0))
min_preval <- 0.65
bad_rows <- which(non_zero_counts <= (ncol(otu_table(phyloseq)@.Data) * min_preval))
data_filtered <- otu_table(phyloseq)@.Data[-(bad_rows),]

I checked some of the OTUs retained by the 65% subset and I found that they have a prevalence lower than 65% (for example, one OTU has 19 of 47 non-zero counts, which makes a 40%, and another has 7 of 47 non-zero counts, making a 15%).
Also, PIME pipeline replaced some of these OTUs original counts by zeros:

Sample	original	filtered	Sample	original	filtered
X02_1	0	0	X60_1	11	11
X03_1	0	0	X61_1	36	0
X04_1	0	0	X62_1	0	0
X05_1	0	0	X63_1	63	63
X06_1	5	5	X64_1	56	56
X07_1	18	18	X65_1	0	0
X08_1	0	0	X66_1	17	0
X09_1	0	0	X67_1	61	0
X10_1	15	0	X68_1	5	0
X11_1	17	17	X69_1	7	7
X12_1	0	0	X70_1	18	18
X13_1	0	0	X71_1	3	3
X14_1	3	3	X72_1	0	0
X15_1	2	0	X73_1	7	7
X16_1	6	6	X74_1	13	0
X17_1	10	10	X75_1	7	0
X18_1	19	0	X76_1	0	0
X19_1	0	0	X77_1	0	0
X20_1	0	0	X78_1	70	70
X54_1	87	87	X79_1	0	0
X56_1	31	0	X81_1	2	2
X57_1	4	4	X83_1	0	0
X58_1	11	11	X85_1	20	20
X59_1	0	0

You will notice that in the original dataset, these OTUs had more non-zero values, although its prevalence still remains lower than 65%.

Any clues to what is happening here? Thanks a lot!

Adrián

microEcology commented 5 years ago

Hi Adrián,

I'm sorry for not answering you sooner. This is expected with PIME. When PIME filters by prevalence, it first split the data set based on the "phenotype" group, used in pime.split.by.variable(), and then it performs the prevalence filtering for each group independently. With this you can have more OTUs kept with pime, since the number of samples are smaller (split by phenotype ) then when using prevalence overall. For example, if you have a total of 200 samples, an OTU have to be present in at least 130 samples to be kept, for 65% prevalence. If you split by "phenotype" and have one group with 50 samples, an OTU must be present in 32 samples, for the same 65% prevalence. Also, when splitting by groups it's expected to have a more homogeneous sample group, which increases the chance of keeping more OTUs for the same 65% prevalence.

Hope this helps, Priscila

On Thu, Oct 3, 2019 at 7:23 AM alopgar notifications@github.com wrote:

Hi! I tried PIME package using this pipeline:

pime.oob.error(phyloseq, "phenotype")

phy_BrDi <- pime.split.by.variable(phyloseq, "phenotype")

prev_BrDi <- pime.prevalence(phy_BrDi)

set.seed(30)

prev_BrDi_best <- pime.best.prevalence(prev_BrDi, "phenotype")

phyloseq_filtered <- prev_BrDi$65

But I faced with some questions:

If I understood correctly, a PIME prevalence interval of 65% means that the subset includes all OTUs with non-zero counts in a minimum of 65% of samples (i.e., remove OTUs with more than 35% of zeros).

If this is true, the prevalence 65% subset retains more OTUs (in my case, 1203 of 1246) than my manually implemented filtering (which removes OTUs with more than 35% zero-counts) (1099 of 1246 OTUs).

My manual prevalence filter:

non_zero_counts <- apply(otu_table(phyloseq)@.Data, 1, function(c)sum(c!=0))

min_preval <- 0.65

bad_rows <- which(non_zero_counts <= (ncol(otu_table(phyloseq)@.Data) * min_preval))

data_filtered <- otu_table(phyloseq)@.Data[-(bad_rows),]

I checked some of the OTUs retained by the 65% subset and I found that they have a prevalence lower than 65% (for example, one OTU has 19 of 47 non-zero counts, which makes a 40%, and another has 7 of 47 non-zero counts, making a 15%).

Also, PIME pipeline replaced some of these OTUs original counts by zeros:

Sample original filtered Sample original filtered X02_1 0 0 X60_1 11 11 X03_1 0 0 X61_1 36 0 X04_1 0 0 X62_1 0 0 X05_1 0 0 X63_1 63 63 X06_1 5 5 X64_1 56 56 X07_1 18 18 X65_1 0 0 X08_1 0 0 X66_1 17 0 X09_1 0 0 X67_1 61 0 X10_1 15 0 X68_1 5 0 X11_1 17 17 X69_1 7 7 X12_1 0 0 X70_1 18 18 X13_1 0 0 X71_1 3 3 X14_1 3 3 X72_1 0 0 X15_1 2 0 X73_1 7 7 X16_1 6 6 X74_1 13 0 X17_1 10 10 X75_1 7 0 X18_1 19 0 X76_1 0 0 X19_1 0 0 X77_1 0 0 X20_1 0 0 X78_1 70 70 X54_1 87 87 X79_1 0 0 X56_1 31 0 X81_1 2 2 X57_1 4 4 X83_1 0 0 X58_1 11 11 X85_1 20 20 X59_1 0 0

You will notice that in the original dataset, these OTUs had more non-zero values, although its prevalence still remains lower than 65%.

Any clues to what is happening here? Thanks a lot!

Adrián

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/microEcology/pime/issues/2?email_source=notifications&email_token=AKSTCIXFXHZ5KCJNJOZC3LDQMXB3TA5CNFSM4I5BGB3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HPLTECQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AKSTCIRZMCHQ2A74RKDUTCLQMXB3TANCNFSM4I5BGB3A .

alopgar commented 4 years ago

Thank you for your answer. I still have some doubts about this.

According to your answer, I understand that the "phenotype" splitting allows keeping more OTUs because it checks the minimum prevalence separately, and keeps OTUs with high prevalence in AT LEAST one of the phenotype groups. Am I correctly assuming this? If not, I wonder how the final OTU selection works if dataset is firstly splitted and then joined to perform the Random Forest (i.e., how the Random Forest input table is composed).
It still remains my question about the forced replacement of some OTU counts to zero value. In the previous example I pasted a unique OTU values across my samples, its original pre-filtered number of counts and the same after PIME performance. ¿Is it a consequence of phenotype splitting also? If so, ¿how is it working exactly?

Thank you very much again. Adrián

vpylro commented 3 years ago

Zero OTU counts are expected after PIME filtering. This is the core computation performed by PIME through pime.prevalence script. The main objective of PIME is to filter OTUs by prevalence. Ordinary prevalence filtering is applied to the entire dataset but PIME applies the filtering according to previously defined groups using pime.split.by.variable script.

Suppose you defined 65% as your prevalence cutoff and you have an experimental design with two treatments. Prior to PIME filtering you have an OTU present in both treatments but after PIME filtering this OTU appears as zero in treatment A and 200 in treatment B. That means the prevalence in treatment A is lower than 65% but in treatment B the prevalence is equal or higher than 65%. PIME will remove reads from treatments where the prevalence is lower than the cutoff and keep sequences where the prevalence is higher or equal to the cutoff.

This way PIME allows the user to look at the dataset under a new perspective where not only presence/absence or abundance of sequences are taken into account. PIME considers whether or not an OTU is present in only a couple of subjects or is present in most of the population. This is important because it solves the problem of data sparsity. Most statistics tools do not handle sparsity very well. However, by using PIME the user must realize that the results are attached to the prevalence concept and this has implications in the interpretation of the final dataset. For instance, ordinary alpha and beta diversity are not recommended. Differential abundance analysis is also not recommended. The best way to find out the OTUs responsible for the differences between treatments is by using Randon Forests implemented in pime.best.prevalence.

jarrodscott commented 3 years ago

Thank you very much for the clear explanation @vpylro. I realized this after I posted my question (which is why I deleted the question :) but it is good to hear you recommendations--esp. about diversity estimates.

vpylro commented 3 years ago

Thank you very much for the clear explanation @vpylro. I realized this after I posted my question (which is why I deleted the question :) but it is good to hear you recommendations--esp. about diversity estimates.

You're very welcome!

jarrodscott commented 3 years ago

Hi @vpylro one quick question on this subject. You mentioned "The best way to find out the OTUs responsible for the differences between treatments is by using Randon Forests implemented in pime.best.prevalence." If I run the following:

ssu_best.prev <- pime.best.prevalence(ssu_prevalences, "TREAT")
ssu_best.prev$`OOB error`

Say the OOB error rate (%) goes to 0 at 65%, then I run

`ssu_best.prev$Importance$`Prevalence 65`

I get a data frame of SequenceID by samples. Is this what you are referring to when you say determine the OTUs responsible for the differences between treatments? Thanks again :) best Jarrod

vpylro commented 3 years ago

Yes, by running this line you get the table with OTU/ASV importance at 65% prevalence. Look at the values of mean decrease accuracy. High values of mean decrease accuracy indicates the importance of taxa to differentiate two or more microbial communities. Best

jarrodscott commented 3 years ago

hello @vpylro

apologies, but I have yet another related question. I was rereading the original PIME publication and was hoping you could clarify something.

On page 427 of the publication it states:

Taking these considerations into account, pime is not comparable to other methods designed to perform microbial differential abundance analysis. However, any other tool can be applied after obtaining the prevalence-filtered data set by pime. For example, the filtered data set can be analyzed using the DESeq2 algorithm (Love et al., 2014) to identify taxa that differ between the saliva microbiome from High and Low KIDMED index (Supporting Information 3).

Yet, earlier in this thread you mentioned that:

Differential abundance analysis is also not recommended.

Could you please clarify? The way I read the passage in the paper (and the code in the Supporting info) is that differential can be performed on a PIME filtered data set however you are saying this is not recommended.

In the same post you also stated:

However, by using PIME the user must realize that the results are attached to the prevalence concept and this has implications in the interpretation of the final dataset. For instance, ordinary alpha and beta diversity are not recommended.

In my mind this is a related issue. If one can perform differential abundance on a data set then that data set should also be appropriate for beta diversity analysis?

So is differential abundance analysis appropriate with a PIME filtered data set? And if so, why is it not ok to use the PIME filtered data for beta diversity estimates?

Thanks again :)

microEcology / pime

Prevalence subsets have different reads than original phyloseq #2

My manual prevalence filter: