Closed dinhe878 closed 7 months ago
The error is indicative of a data that are not suitable for Fisher α. It is possible to change the function so that it does not return an error, but gives a number instead. However, I am not sure that this number makes sense. I will need some time to think what to do. Either I change the function to always return numbers, or check input data and refuse to handle, say 1-species communities. (Fitting a Fisher abundance distribution model to a community of one species really makes no sense). Your example data contained several such cases, in particular with your two column data. No species (both 0) were handled OK, but only one species was really misleading, and the returned numbers were useless.
Thanks @jarioksa for the explanation. However, I am pretty sure that all samples of my data (the example is just a tiny portion of it) contains more than more taxa (ASVs). For example, the column (sample) 22 in my data has 451 taxa:
> sum(as.vector(otu_table(GAGA_ampSeq.physeq.rarefy)[,22]) != 0)
[1] 451
And the summary of my data:
> GAGA_ampSeq.physeq.rarefy
phyloseq-class experiment-level object
otu_table() OTU Table: [ 14889 taxa and 237 samples ]
sample_data() Sample Data: [ 237 samples by 5 sample variables ]
tax_table() Taxonomy Table: [ 14889 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 14889 tips and 14841 internal nodes ]
I found a relevant post that reported the same issue: joey711/phyloseq#1009
The diversity is calculated for rows, not for columns. So it is the first index (row) that you need to inspect, not the second (column). You see this also from the output: your first index has name e80983a81553e91291e7c4713fe4c0a9
which is a row name. When you calculate diversity with indices vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,23:24])
, you ask diversity for ten rows (1:10
), each index based on two species 23:24
, and many of these rows are double zeros. Similarly, also in your first example, you only have five species 20:24
, and for the first row, these all have zero values. So you need to check the data by rows.
When you calculate diversity with indices
vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,23:24])
, you ask diversity for ten rows (1:10
), each index based on two species23:24
, and many of these rows are double zeros. Similarly, also in your first example, you only have five species20:24
, and for the first row, these all have zero values. So you need to check the data by rows.
Ok thanks. But then what puzzles me still is that the issue only rose with
vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,21:22])
and
vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,22:23])
but not
vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,20:21])
and
vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,23:24])
Maybe I didn't understand correctly....so if the problem was caused by "all-zero" rows then all of above examples should give the same errors as they all contains "all-zero" rows, right?
And I checked my data, no single ASV is "all-zero" across all samples (i.e. identical taxa count before/after prune_taxa):
> no_empty_ASVs <- prune_taxa(taxa_sums(GAGA_ampSeq.physeq.rarefy)>0,GAGA_ampSeq.physeq.rarefy)
> GAGA_ampSeq.physeq.rarefy **(before prune_taxa)**
phyloseq-class experiment-level object
**otu_table() OTU Table: [ 14889 taxa and 237 samples ]**
sample_data() Sample Data: [ 237 samples by 5 sample variables ]
tax_table() Taxonomy Table: [ 14889 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 14889 tips and 14841 internal nodes ]
> no_empty_ASVs
phyloseq-class experiment-level object **(after prune_taxa)**
**otu_table() OTU Table: [ 14889 taxa and 237 samples ]**
sample_data() Sample Data: [ 237 samples by 5 sample variables ]
tax_table() Taxonomy Table: [ 14889 taxa by 7 taxonomic ranks ]
phy_tree() Phylogenetic Tree: [ 14889 tips and 14841 internal nodes ]
I was not writing about error, but about inability to calculate diversity for empty rows. The error was a good thing because it alerted about the problems, and I have now changed the function in github to that effect (not an error, but giving NA).
Fisher α is based on species abundance distribution and you cannot have abundance distribution if you have no species or one species. The equations were not valid for these situations, but they could give a number looking like a result. That number had little do with diversity, but it just appeared when the equations were used outside their valid domain. If you had an empty row (all zeros), the result was 1: no error, but also no sense. If you only had one species, you could get a number, and the numeric value depended on the abundance of that one species. However, if that one non-zero integer was 6 or higher, you got an error. I could have changed the function so that you also get a number for these cases (I tested and it worked), but these numbers make little sense, and therefore I instead disabled calculation (or return NA) when there is only one species.
I have been aware of these problems for a long time, but I have assumed that the users of these functions know what they do and have not prevented obvious misapplications. It seems that in particular in phyloseq these functions are used in cavalier way, and the next version of vegan will try to prevent the misuse.
Thank you @jarioksa!
Just to mention that after upgrading the package, for unknown (yet) reason that my full dataset still produced the same error message, although the example above now produces expected results.
At this point I'll just tentatively skip fisher alpha estimation and work with other indices.
Thanks a again for your help!
The next version was not released yet (and won't be released for a long time). If you want to get the bleeding edge version prior to its release, you should download vegan from github.
I thought that's what I did:
devtools::install_github("vegandevs/vegan")
> packageVersion("vegan")
[1] ‘2.6.0’
Or this is still the stable version?
I didn't build any full proof protection against failing estimation of Fisher α, but just blocked two most obvious bad cases: no species or one species. It is possible to have communities with two or more species that are so different from the Fisher abundance distribution model that estimation fails. For instance fisher.alpha(c(40,42))
fails. Fisher α is based on function fisherfit
that tries to fit the Fisher abundance model to the observed abundance distribution and returns the single model parameter α as the diversity index. If the Fisher abundance model is absolutely unsuitable, an error may result. Even if there is no error, the estimate can be very unreliable. For instance, would you think that the fitted line and its single parameter α = 0.372 is a good descriptor of the data in the attached graph? Disabling calculation in the worst cases makes sense, but it is hard to decide where to put the limit of worst cases (0 and 1 species are always wrong, but two species can fit nicely, although never with high reliability). However, the index should not be used light heartedly, but the user should know what they want and what they do, or they end up shooting themselves in the foot. I think phyloseq should not calculate Fisher α unless specifically requested.
Thanks a lot @jarioksa for the explanation! It would be great if the error/warning messages could provide more information. As it is for now the error indicates more like an "technical error" rather than, say a warning "bad model fitting, i.e. fisher estimation may be unreliable". It might also make other packages easier to handle so any following steps won't be blocked because of a "appears to be" error raised.
Hi,
I am using vegan::fisher.alpha(). My test data is
I tested three "2-column" subsets:
For some reason, there is an issue from column (sample) 22, but I am not quite sure what is causing this problem.
Any help is very much appreciated!
Best, Ding