vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
442 stars 96 forks source link

rrarefy keeping unrarefied results of samples with >depth than subsample #535

Closed TJrogers86 closed 1 year ago

TJrogers86 commented 1 year ago

Hello developers! Im having a strange issue with rrarefy that I cant seem to figure out. At first, I thought it was a problem. I have a count table that is 77x10493 (samples are row names and vContigs are column names). When I use rowSums(), I find that the lowest read depth is 5892632 (sample ARS). So, when using avgdist() I set sample=5000000. However, I get this warning message: "The following sampling units were removed because they were below sampling depth: CZF, CIF, RCF". But, their read depths are 2517543402, 3066348209, and 303772148 respectively. I know rrarefy is used in avedist, so I tested rrafey with the same cutoff. As rrarefy returns the unrarefied numbers if sample deepth is lower that the selected cut off, I looked at the rrarefied cuts. All of my samples had been rarefied except for CZF, CIF, and RCF. Am I doing something wrong?

gavinsimpson commented 1 year ago

It's impossible to tell as you haven't provided any code that you ran or specific errors raised. rrarefy has https://github.com/vegandevs/vegan/blob/e5f92511df951ac6c13e96436e4fa15c5e0c75d9/R/rrarefy.R#L19-L20

Did you get that warning when you called rrarefy()?

That said, this in avgdist() is almost surely wrong https://github.com/vegandevs/vegan/blob/e5f92511df951ac6c13e96436e4fa15c5e0c75d9/R/avgdist.R#L20

The check should be more like

inputcast <- inputcast[c(rowSums(inputcast) >= sample), ]
TJrogers86 commented 1 year ago

Thank you for the quick response and sorry for not providing my code. Hopeful this isnt too winded: my otu count table is called "abun_table2". Before I transpose it, it is in this form 10493x77 (samples are column names and vContigs are row names). Here is an image of the count table:

image

Before I run it through avgdist or rrarefy, i modify it with the following code:

for_rare = as.data.frame(abun_table2) %>%
    select(-DS, -ExtractionBlank) %>%
    t %>%
    data.frame %>%
    rownames_to_column(.,var = "Sites") %>%
    pivot_longer(-Sites) %>%
    group_by(Sites) %>%
    dplyr::mutate(total = sum(value)) %>%
    filter(total > 5000000) %>%
    arrange(total) %>%
    group_by(name) %>%
    dplyr::mutate(total = sum(value)) %>% # At this step we have 10,493 OVU's
    filter(total != 0) %>%                # At this step we have 10,462. We have lost 31 OVU's
    ungroup %>%
    select(-total) %>%
    pivot_wider(Sites) %>%
    column_to_rownames(.,var = "Sites") %>%
    as.matrix

The out put is in the form I mentioned above: 77x10,462 (samples are row names and vContigs are column names). Then I run the following:

For rrarefy:

rarefied_data <- rrarefy(for_rare, 5000000) 

For avgdist:

distmatrix <- avgdist(for_rare, sample = 5000000,
                        dmethod = "jaccard")

To answer your question, yes I did get the rrarefy warning. I will create my own version of the avgdist with the correction you suggested and see what the results are, however in this case, I believe the issue is with the rrarefy step: either on my end (user error) or maybe my dataframe is too large for rrarefy to function properly (7.4 MB)? Thanks again for your help!

jarioksa commented 1 year ago

Your "depths" seem to be larger than R integer maximum

> x <- c(2517543402, 3066348209, 303772148)
> x > .Machine$integer.max
[1]  TRUE  TRUE FALSE
> .Machine$integer.max
[1] 2147483647

This depends on invidual terms (values of cells) instead of total, though. You may see if you get the integer range error by calling directly the C code in rrarefy for a single row of your data:

> .Call(vegan:::do_rrarefy, x, 10)
[1]  0  0 10
Warning message:
NAs introduced by coercion to integer range 

If you get the integer range error, then your numbers are too large for C code.

TJrogers86 commented 1 year ago
.Call(vegan:::do_rrarefy, x, 10)

I am assuming here that the 'x' is my column? Im sorry, im not familiar with calling the C code

TJrogers86 commented 1 year ago
.Call(vegan:::do_rrarefy, x, 10)

I am assuming here that the 'x' is my column? Im sorry, im not familiar with calling the C code

Nevermind, I read your response carefully this time, and i see where you created the 'x' variable. I understand now. Ill run it and see what happens

jarioksa commented 1 year ago

@TJrogers86 if for_rare are your data, then something like

.Call(vegan:::do_rrarefy, for_rare[3,], 10) # for row 3
TJrogers86 commented 1 year ago

@jarioksa Thanks again for the help, but I tried this on several rows including the one with the largest rowSums and it never throw the error.

TJrogers86 commented 1 year ago

Sorry, I mean it never through the warning message

TJrogers86 commented 1 year ago

Actually, I just ran it on the rowSums:

test <- data.frame(for_rare_dist_matrix)
test2 <- rowSums(test)
test3 <- sort(.Call(vegan:::do_rrarefy, test2, 10))

and this did through the error. I think you are correct. And I think I know how to fix my issue. Thanks again

gavinsimpson commented 1 year ago

I suspect you shouldn't be rarefying your data like this - or, at least expect to get some push-back from some reviewers if you do rarefy data such as this.

With numbers as large as this, perhaps consider dividing them by some multiple of 10 and truncating / rounding the values to integers (if you want to still do rarefied counts). Or look at other transformations such as a variance stabilizing transformation or a regularized log-transformation (see package DESeq2 in the Bioconductor suite for both of these) or dispweight() in vegan as alternatives for extra-Poisson dispersion, which often accompanies variation in read depth / library size.

TJrogers86 commented 1 year ago

@gavinsimpson Thanks for the advice and info! These are counts that have been adjusted for contig length. To keep low count numbers from becoming NA, I had multiple by a big number (10^7) and round to the closes interger. That caused my counts to be as high as they were. Adjusting 10^7 to 10^6 and rounding brought the counts into a level that can be used without issue. If i bring it down to 10^5 then I'll have NA's. So, Im hoping this is acceptable. I plan to use rarefied df and matrix for alpha and beta diversity measures respectively and will use relative abundance otherwise. Any advice/suggestions are welcome. Thanks for y'alls help thus far!

jarioksa commented 1 year ago

I think random rarefaction (rrarefy) is rarely sensible – although considered a must in some fields of science – but it is only meaningful with observed counts. The idea of a random rarefaction sample is that you either include a species or you drop it, but you cannot have a "part" of the species. If you think you need to multiply your observed counts, you should first rarefy and then multiply the rarefied counts. Multiplying before rarefaction clones your data and allows you to have a part of a taxon (OTU) that you only observed once, and the higher the multiplier, the closer the species proportions in rarefied sample become to observed proportions.

I attach a graph of my quick demonstration using the first sampling unit of the BCI data in vegan with 448 trees and 93 species, and 31 species only with one tree (individual). The graph shows the probability of inclusion of a species in a rarefied sample (function drarefy) of size 50. Then I multiplied both the observed counts and the sample size with different multipliers and recalculated the inclusion probabilities. For multiplier 10 there were 4480 trees, but still only the same 93 species, and the rarest 31 species each with 10 trees, and the sample size was 500. Had you collected real data of 4480 trees, there would have been more than 93 species, and the number of species occurring only once would have been about the same as now (and some of the species found only once among 448 trees would perhaps be found only once among 4480 trees – but most of the new singletons would be new species). clonedrarefaction

I think that rarefaction functions (rarefy, rrarefy, drarefy) are wrongly used in most cases when the lowest non-zero integer is larger than 1. I think I should add a test for this and issue a warning of possible problems. Not an error, because, of course, there can be cases with no singleton species. There are at least two user cases that falsely remove singletons: multiplication of observed counts and removing rarest species from data. Both of these cases are unsuitable for rarefaction and need warning. If I do this, I guess several BioC and genomics packages will warnings.

gavinsimpson commented 1 year ago

Further to @jarioksa's answer, the contig length (which I don't understand the implications of too well as I'm not well-versed in this high-throughput world) would be something that we would normally handle via an offset term in a statistical model.

The problem we have in the multivariate world of ordination is that we can't apply an offset (newer methods are starting to become available, so-called model-based ordination, which do allow offsets, but these new methods are simply not ready for use with data such as yours with a large number of variables), so people look for alternatives.

I personally don't understand the logic behind avgdist(). Wouldn't it be better to run rrarefy() lots of times and take the mean count of each species over the set of randomly rarefied samples - asymptotically I would expect this to converge as you increase the number of random draws. Then you could compute the dissimilarity or your ordination using those "expected" rarefied counts. At the very least it would be a heck of a lot quicker to do what avgdist() does in the way I suggest as you don't need to compute the dissimilarity for each random rarefied sample draw.

That said, I also wouldn't do that in general or in your case. As Jari said, I would try to work with the raw counts as much as possible and then use an offset (based on the size factors and other considerations) when fitting classical statistical models to your data (e.g. GLMs or GLMMs) or a transformations of the raw count data that tries to account for the extra dispersion in the data that arises due to these methodological considerations with high throughput data.

gavinsimpson commented 1 year ago

As for some suggestions on how to proceed, I would suggest taking a look at some of the work of Susan Holmes and colleagues:

Sankaran, K., Holmes, S.P., 2019. Latent variable modeling for the microbiome. Biostatistics 20, 599–614. https://doi.org/10.1093/biostatistics/kxy018

Jeganathan, P., Holmes, S.P., 2021. A Statistical Perspective on the Challenges in Molecular Microbial Biology. J. Agric. Biol. Environ. Stat. 26, 131–160. https://doi.org/10.1007/s13253-021-00447-1

Sankaran, K., Holmes, S.P., 2019. Multitable Methods for Microbiome Data Integration. Front. Genet. 10, 627. https://doi.org/10.3389/fgene.2019.00627

At least these would give you a better understanding of the issues (though note Susan is firmly in the "thou shalt not rarefy" camp) and give you some ideas about how statisticians are attempting to tackle the problems.

jarioksa commented 1 year ago

@gavinsimpson Species averages of randomly rarefied samples converge to observed (non-rarefied) species frequencies adjusted to sub-sample size.

gavinsimpson commented 1 year ago

@jarioksa isn't that what people want, to get counts adjusted to the sub-sample size?

gavinsimpson commented 1 year ago

Or did you mean you could just avoid the random sampling and simply use data in proportion to the observed frequencies?

jarioksa commented 1 year ago

@gavinsimpson not really: they want equalized sample sizes, and with averages they get non-rarefied numbers. For a subsample of 50 in BCI, the species averages converge decostand(BCI, "total")*50. I really do not understand the widespread use of random rarefaction in genomics and therefore I may not be able to speak for the approach, but I think they are concerned of having too many rare OTUs with large number of sequences, and want to equalize the detection probablity of rare cases independent of number of sequences (depth). The average of large number of randomly rarefied samples is about the same as using relativized data (hiding the sample size, depth), but keeping more rare cases (OTUs) in originally large samples (and naturally, rrarefy adds some random error unlike relativized data).

jarioksa commented 1 year ago

FWIW, adding a warning on >1 minimum count in rarefy, drarefy, rrarefy and rareslope caused no regressions in reverse dependence tests of 172 CRAN packages depending on vegan. However, I didn't test BioConductor packages (too painful, in particular because they do not support Apple Silicon processors).

gavinsimpson commented 1 year ago

@jarioksa I see what you mean.

Did you look at that test in avgdist() which seemed suss to me https://github.com/vegandevs/vegan/blob/e5f92511df951ac6c13e96436e4fa15c5e0c75d9/R/avgdist.R#L20

If it is suss - I was just reading code that I've never run - fixing this before the release would be good, which I can do.

jarioksa commented 1 year ago

@gavinsimpson The main author of avgdist is @Microbiology (Geoffrey Hannigan). Better ask his comment. My first look was that your fix was OK, but better ask Hannigan.

gavinsimpson commented 1 year ago

@jarioksa OK - there are a couple of other things in that function that I want to fix so I'll do a PR with all the changes and then ask @Microbiology to comment.

TJrogers86 commented 1 year ago

Hey guys. Thanks a lot for the help and suggestions. I have read some of the dislike of rarefy and was going to avoid it, but I came across Dr. Pat Schloss' YouTube channel and his videos demonstrating why he advocates for rarefying when it comes to diversity and richness analyses and it was convincing enough to try it (I believe he was the aforementioned @Microbiology Dr. Geoffrey Hannigan's Advisor at one point). If your interested, I've posted links below to said videos for rational and demonstration. If you have time to watch them, I'd like your thoughts on them. In any case, thanks again for the reading suggestions and for this great discussion!

To rarefy or not to rarefy: https://www.youtube.com/watch?v=ht3AX5uZTTQ

Rarefy vs Normalize: https://www.youtube.com/watch?v=t5qXPIS-ECU&t=531s

Minimizing sampling effort impact on bray curtis https://www.youtube.com/watch?v=6TjzjClQsOg

Microbiology commented 1 year ago

Thanks all. I just went over to the PR and it looks good from my end.

gavinsimpson commented 1 year ago

@TJrogers86 I would say that in many of these discussions about rarefying counts the discussants are talking past one another. Pat's videos (IIRC; I haven't watched all of this YouTube channel) on this topic focus on the impact of rarefying on computing dissimilarities.

Many people use rrarefy() for way more than that though - like using rrarefy() followed by a GLM or GLMM, which would be extremely bad.

That said, just because doing the avgdist() thing works in Pat's example(s) doesn't mean it will work in your data or in all other data sets - part of the thing that concerns more statistically minded individuals in these discussions is knowing where a method breaks not that it works in one or more specific cases. There are plenty of examples where things break, but, for obvious reasons, we haven't looked at all possible data sets or all possible use-cases as yet...

Personally, as someone who has only relatively recently started to be dragged into working with data like this, I'm not keen on using rarefying counts via random draws using rrarefy(), in part because I prefer to use RDA on transformed data rather than distance based methods (although dbrda() is proving useful too) and you can't (easily) compute a whole pile of RDA ordination models, 1 per random rrarefy() draw and then "average" them.