merge_samples into groups with varying number of samples

StefPN commented 10 years ago

Hi, For several plotting functions I would like to merge my samples into my three clinical groups. Each sample is adjusted to even sequencing depth, so this is not a problem for merge_samples. However, my three groups contain: 175, 224 and 256 samples, respectively, so just forming the sum of all samples per group would bias my data. How can I adjust for this? Another issue is that my data is not normally distributed so I would actually prefer to calculate the median over the mean for each group and plot that. How can I do that instead? I am sure there is some kind of R function for that as well but I am not very experienced in R. Thank you for your help! Kind regards, Stef

joey711 commented 10 years ago

Hi @StefPN

It might help if you explain more what you are trying to do. Clearly there are many analyses where you don't want to the merge the samples/replicates within a clinical class, because you need that information as part of your estimation of uncertainty. If you want to plot a central value for each class, typically you would still retain the replicates, but, at least in ggplot2, add a summarizing layer.

"Each sample is adjusted to even sequencing depth"... This sounds like rarefying, which is something you should not do. Ever.. As you noticed in this one simple application of calculating central value(s) for each sample class, having thrown away data to achieve "even sequencing depth" did not actually make your task easier, because you also have different numbers of samples per class. Even if that were not the case, or you decided to throw out even more data (sample) so that you had "even sample depth per class" -- also not a good idea -- you would still be guaranteed to have a sub-optimal result in your analysis downstream.

Okay, enough about why you should never throw away data. Please respond with a more precise answer about what it is you are trying to plot and/or calculate.

Thanks again for your feedback. It is helpful as usual, and probably a question someone else will find useful as well.

Cheers

joey

StefPN commented 10 years ago

Hi Joey, Thanks for getting back to me! Well, I have a total of nearly 700 samples which belong to 3 different clinical groups or 5 different "statuses". I want to create a few figures just describing some general trends to give the reader an overview. But when I create a heatmap or run plot_tree it is just too much information to be able to interpret or, in the case of the heatmap, see which sample is which. I would like to show just some kind of summary of the five different status-groups to make this clearer. Most likely the median as the otus are not normally distributed. This is completed by statistical tests for significance. For the statistical purpose I have adjusted the sequencing depth by using proportions of our normalized dataset, not rarefaction. That was the best strategy we examined. Anyway, I wonder now, how one could "merge" all samples of one group and plot the median of the groups instead of all individual samples. Can I do that within the phyloseq object or do I need to calculate the medians from the original otu-table and create a new phyloseq object based on that otu table? Please let me know if anything is still unclear. Thank you for your help! Kind regards, Stef

From: Paul J. McMurdie [notifications@github.com] Sent: Monday, June 09, 2014 9:16 PM To: joey711/phyloseq Cc: Stefanie Prast-Nielsen Subject: Re: [phyloseq] merge_samples into groups with varying number of samples (#356)

Hi @StefPNhttps://github.com/StefPN

It might help if you explain more what you are trying to do. Clearly there are many analyses where you don't want to the merge the samples/replicates within a clinical class, because you need that information as part of your estimation of uncertainty. If you want to plot a central value for each class, typically you would still retain the replicates, but, at least in ggplot2, add a summarizing layer.

"Each sample is adjusted to even sequencing depth"... This sounds like rarefying, which is something you should not do. Ever.http://dx.plos.org/10.1371/journal.pcbi.1003531. As you noticed in this one simple application of calculating central value(s) for each sample class, having thrown away data to achieve "even sequencing depth" did not actually make your task easier, because you also have different numbers of samples per class. Even if that were not the case, or you decided to throw out even more data (sample) so that you had "even sample depth per class" -- also not a good idea -- you would still be guaranteed to have a sub-optimal result in your analysis downstream.

Okay, enough about why you should never throw away data. Please respond with a more precise answer about what it is you are trying to plot and/or calculate.

Thanks again for your feedback. It is helpful as usual, and probably a question someone else will find useful as well.

Cheers

joey

— Reply to this email directly or view it on GitHubhttps://github.com/joey711/phyloseq/issues/356#issuecomment-45531719.

joey711 commented 10 years ago

Did you look at the merge_samples function? It is designed for this purpose, for example.

Is this issue still outstanding for you? If you did solve your problem, do you mind posting here how you did it, and/or what you might have preferred?

Also, the new app Shiny-phyloseq might be useful for you, but you'll probably want to use an ordination or network plot, rather than a heatmap or tree, to summarize your data (or merge the samples before uploading for analysis).

Thanks again for the feedback, and your interest in phyloseq.

joey711 commented 10 years ago

I will close this for now. It does seem related to a currently open issue...

https://github.com/joey711/phyloseq/issues/386

nasden commented 5 years ago

Did you look at the merge_samples function? It is designed for this purpose, for example.

Is this issue still outstanding for you? If you did solve your problem, do you mind posting here how you did it, and/or what you might have preferred?

Also, the new app Shiny-phyloseq might be useful for you, but you'll probably want to use an ordination or network plot, rather than a heatmap or tree, to summarize your data (or merge the samples before uploading for analysis).

Thanks again for the feedback, and your interest in phyloseq.

Hi,

SO I have two sets of features_hdf5.biom files (the second one is from the re-run sequences). How do I merge my two features_hdf5.biom files and replace a few samples from the second file onto the first file?

Nas

MelissaUribe commented 5 years ago

Hello! I think I'm having a similar issue, although I have different purposes, I just wanna make a composition plot...

I have 4 groups, 2 of those have 5 replicates and 2 have only 4 replicates.

-I have already standardized to median sequencing depth: total = median(sample_sums(physeqtpf)) standf = function(x, t=total) round(t * (x / sum(x))) AyB = transform_sample_counts(physeqtpf, standf)

-And transformed to relative abundances: AyBrelative = transform_sample_counts(AyB, function(x) x / sum(x) )

Then I use merge_samples to group the replicates in their respective groups and plot: AyBCode <- merge_samples(AyBrelative, "Code", fun = median) plot_bar(AyBCode, fill = "Phylum") + geom_bar(aes(color=Phylum, fill=Phylum), stat="identity", position="stack")

The problem is that I get a plot that has 1-5 in the Y axis and the groups with 5 replicates have bars up to 5, and the ones with 4 have bars up to 4...I know this comes from the fact that merge_samples sums the reads in the samples...but I havent been able to figure out how to fix it so that i get a plot up to 100% for all the bars, and not with dissimilar heights...

Thanks for all your help in advance

brenzink commented 4 years ago

I have exactly the same problem. Did you find an answer?

joey711 / phyloseq

merge_samples into groups with varying number of samples #356