mothur / mothur.github.io

wiki for the mothur software package
https://mothur.github.io
Creative Commons Attribution 4.0 International
19 stars 20 forks source link

Question on: summary.single #84

Closed nnthule closed 6 months ago

nnthule commented 2 years ago

I have two questions, one about improving the coverage and one about further analyses on the alpha diversity indices.

1) The root of my problem is that while most of my samples have 4000 - 7000 good reads (after cleaning), one sample has only 1500 reads. Therefore, when running summary.single with subsample = T, my coverage is only 0.7 for all samples (even the one with the lowest count). Would this significantly impair the quality of my further analyses? a) I tried increasing the iters to 25000, but the coverage is still the same. Is there a way to increase the coverage? Should I have used withreplacement = True?

2) I am wondering how to interpret the standard deviations (std) in the resulting file (ave-std.summary). In my dataset, each group has 5 biological replicates, so I'm going to average the Shannon indices of the 5 replicates for each group. Then, should I calculate a new standard deviation for each group from the 5 replicates' Shannon values only, or should I propagate the std from the ave-std.summary by linear transformation? The propagated std would be smaller than the newly calculated std, but I'm afraid the former does not really capture the variation between the biological replicates. a) If I go with calculating new std from the 5 replicates' values, that would means I only have 5 data points for each group. Then, to find out if the alpha diversity indices of the groups are significantly different, should I use the Wilcoxon rank sum test (since sample size is small)? Or should I still use t-test, since each Shannon index of the 5 replicates is already an average value?

pschloss commented 2 years ago

Hi there -

  1. Short of generating more sequence data or dropping the poorly sequenced samples, there's no way to improve the coverage. The iters function is the number of subsamplings that are performed that are then averaged to provide the results. Increasing that value will increase the precision on your estimates. withreplacement should probably always be set to FALSE
  2. I always toss the std values. That value is the standard deviation across the subsamplings