fisher.alpha issue - Githubissues

dinhe878 commented 3 years ago

Hi,

I am using vegan::fisher.alpha(). My test data is

> otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,20:24]
OTU Table:          [10 taxa and 5 samples]
                     taxa are rows
                                 Aenictus_ceylonicus_422_2 Aenictus_ceylonicus_422_3
e80983a81553e91291e7c4713fe4c0a9                         0                         0
918b76e7c690d758f4b6468ce64402b5                         0                         0
05fa7b1e4fdc0722a0d4870452195cc0                         0                         0
36a895ff5d15139c829a80473629e636                         2                         0
bde998bfe8216396964d3722470ded9c                         0                         0
f51bc0209843da239ebb73eebf50e48d                         0                         0
968953f16bc08eff934d3b7c12e0e986                         0                         0
1f5cc7a9bad94196d3fb901567ca5855                         0                         0
e7fc2d63b8792cb23aaead3d740e3832                         0                         0
e49b88a31b09a87049c14d10976a1c5e                         0                         0
                                 Allomerus_decemarticulatus_37_3 Anoplolepis_gracilipes_423_1
e80983a81553e91291e7c4713fe4c0a9                               0                            0
918b76e7c690d758f4b6468ce64402b5                               0                            0
05fa7b1e4fdc0722a0d4870452195cc0                               0                            0
36a895ff5d15139c829a80473629e636                               0                            0
bde998bfe8216396964d3722470ded9c                               0                            0
f51bc0209843da239ebb73eebf50e48d                               6                            0
968953f16bc08eff934d3b7c12e0e986                               0                            0
1f5cc7a9bad94196d3fb901567ca5855                               0                            0
e7fc2d63b8792cb23aaead3d740e3832                               0                            0
e49b88a31b09a87049c14d10976a1c5e                               0                            0
                                 Anoplolepis_gracilipes_423_2
e80983a81553e91291e7c4713fe4c0a9                            0
918b76e7c690d758f4b6468ce64402b5                            0
05fa7b1e4fdc0722a0d4870452195cc0                            0
36a895ff5d15139c829a80473629e636                            0
bde998bfe8216396964d3722470ded9c                            0
f51bc0209843da239ebb73eebf50e48d                            0
968953f16bc08eff934d3b7c12e0e986                            0
1f5cc7a9bad94196d3fb901567ca5855                            0
e7fc2d63b8792cb23aaead3d740e3832                            0
e49b88a31b09a87049c14d10976a1c5e                            0

I tested three "2-column" subsets:

> vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,20:21])
e80983a81553e91291e7c4713fe4c0a9 918b76e7c690d758f4b6468ce64402b5 05fa7b1e4fdc0722a0d4870452195cc0 
                       1.0000000                        1.0000000                        1.0000000 
36a895ff5d15139c829a80473629e636 bde998bfe8216396964d3722470ded9c f51bc0209843da239ebb73eebf50e48d 
                       0.7959049                        1.0000000                        1.0000000 
968953f16bc08eff934d3b7c12e0e986 1f5cc7a9bad94196d3fb901567ca5855 e7fc2d63b8792cb23aaead3d740e3832 
                       1.0000000                        1.0000000                        1.0000000 
e49b88a31b09a87049c14d10976a1c5e 
                       1.0000000

> vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,21:22])
Error in uniroot(d1fun, c(1, 50), extendInt = "upX", S = S, N = N, ...) : 
  did not succeed extending the interval endpoints for f(lower) * f(upper) <= 0
In addition: Warning message:
In log(1 + N/x) : NaNs produced

> vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,22:23])
Error in uniroot(d1fun, c(1, 50), extendInt = "upX", S = S, N = N, ...) : 
  did not succeed extending the interval endpoints for f(lower) * f(upper) <= 0
In addition: Warning message:
In log(1 + N/x) : NaNs produced

> vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,23:24])
e80983a81553e91291e7c4713fe4c0a9 918b76e7c690d758f4b6468ce64402b5 05fa7b1e4fdc0722a0d4870452195cc0 
                               1                                1                                1 
36a895ff5d15139c829a80473629e636 bde998bfe8216396964d3722470ded9c f51bc0209843da239ebb73eebf50e48d 
                               1                                1                                1 
968953f16bc08eff934d3b7c12e0e986 1f5cc7a9bad94196d3fb901567ca5855 e7fc2d63b8792cb23aaead3d740e3832 
                               1                                1                                1 
e49b88a31b09a87049c14d10976a1c5e 
                               1

For some reason, there is an issue from column (sample) 22, but I am not quite sure what is causing this problem.

> otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,22]
OTU Table:          [10 taxa and 1 samples]
                     taxa are rows
                                 Allomerus_decemarticulatus_37_3
e80983a81553e91291e7c4713fe4c0a9                               0
918b76e7c690d758f4b6468ce64402b5                               0
05fa7b1e4fdc0722a0d4870452195cc0                               0
36a895ff5d15139c829a80473629e636                               0
bde998bfe8216396964d3722470ded9c                               0
f51bc0209843da239ebb73eebf50e48d                               6
968953f16bc08eff934d3b7c12e0e986                               0
1f5cc7a9bad94196d3fb901567ca5855                               0
e7fc2d63b8792cb23aaead3d740e3832                               0
e49b88a31b09a87049c14d10976a1c5e                               0

Any help is very much appreciated!

Best, Ding

jarioksa commented 3 years ago

The error is indicative of a data that are not suitable for Fisher α. It is possible to change the function so that it does not return an error, but gives a number instead. However, I am not sure that this number makes sense. I will need some time to think what to do. Either I change the function to always return numbers, or check input data and refuse to handle, say 1-species communities. (Fitting a Fisher abundance distribution model to a community of one species really makes no sense). Your example data contained several such cases, in particular with your two column data. No species (both 0) were handled OK, but only one species was really misleading, and the returned numbers were useless.

dinhe878 commented 3 years ago

Thanks @jarioksa for the explanation. However, I am pretty sure that all samples of my data (the example is just a tiny portion of it) contains more than more taxa (ASVs). For example, the column (sample) 22 in my data has 451 taxa:

> sum(as.vector(otu_table(GAGA_ampSeq.physeq.rarefy)[,22]) != 0)
[1] 451

And the summary of my data:

> GAGA_ampSeq.physeq.rarefy
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 14889 taxa and 237 samples ]
sample_data() Sample Data:       [ 237 samples by 5 sample variables ]
tax_table()   Taxonomy Table:    [ 14889 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 14889 tips and 14841 internal nodes ]

I found a relevant post that reported the same issue: joey711/phyloseq#1009

jarioksa commented 3 years ago

The diversity is calculated for rows, not for columns. So it is the first index (row) that you need to inspect, not the second (column). You see this also from the output: your first index has name e80983a81553e91291e7c4713fe4c0a9 which is a row name. When you calculate diversity with indices vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,23:24]), you ask diversity for ten rows (1:10), each index based on two species 23:24, and many of these rows are double zeros. Similarly, also in your first example, you only have five species 20:24, and for the first row, these all have zero values. So you need to check the data by rows.

dinhe878 commented 3 years ago

When you calculate diversity with indices vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,23:24]), you ask diversity for ten rows (1:10), each index based on two species 23:24, and many of these rows are double zeros. Similarly, also in your first example, you only have five species 20:24, and for the first row, these all have zero values. So you need to check the data by rows.

Ok thanks. But then what puzzles me still is that the issue only rose with

vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,21:22]) and vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,22:23])

but not

vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,20:21]) and vegan::fisher.alpha(otu_table(GAGA_ampSeq.physeq.rarefy)[1:10,23:24])

Maybe I didn't understand correctly....so if the problem was caused by "all-zero" rows then all of above examples should give the same errors as they all contains "all-zero" rows, right?

And I checked my data, no single ASV is "all-zero" across all samples (i.e. identical taxa count before/after prune_taxa):

> no_empty_ASVs <- prune_taxa(taxa_sums(GAGA_ampSeq.physeq.rarefy)>0,GAGA_ampSeq.physeq.rarefy)

> GAGA_ampSeq.physeq.rarefy **(before prune_taxa)**
phyloseq-class experiment-level object
**otu_table()   OTU Table:         [ 14889 taxa and 237 samples ]**
sample_data() Sample Data:       [ 237 samples by 5 sample variables ]
tax_table()   Taxonomy Table:    [ 14889 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 14889 tips and 14841 internal nodes ]
> no_empty_ASVs
phyloseq-class experiment-level object **(after prune_taxa)**
**otu_table()   OTU Table:         [ 14889 taxa and 237 samples ]**
sample_data() Sample Data:       [ 237 samples by 5 sample variables ]
tax_table()   Taxonomy Table:    [ 14889 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 14889 tips and 14841 internal nodes ]

jarioksa commented 3 years ago

I was not writing about error, but about inability to calculate diversity for empty rows. The error was a good thing because it alerted about the problems, and I have now changed the function in github to that effect (not an error, but giving NA).

Fisher α is based on species abundance distribution and you cannot have abundance distribution if you have no species or one species. The equations were not valid for these situations, but they could give a number looking like a result. That number had little do with diversity, but it just appeared when the equations were used outside their valid domain. If you had an empty row (all zeros), the result was 1: no error, but also no sense. If you only had one species, you could get a number, and the numeric value depended on the abundance of that one species. However, if that one non-zero integer was 6 or higher, you got an error. I could have changed the function so that you also get a number for these cases (I tested and it worked), but these numbers make little sense, and therefore I instead disabled calculation (or return NA) when there is only one species.

I have been aware of these problems for a long time, but I have assumed that the users of these functions know what they do and have not prevented obvious misapplications. It seems that in particular in phyloseq these functions are used in cavalier way, and the next version of vegan will try to prevent the misuse.

dinhe878 commented 3 years ago

Thank you @jarioksa!

Just to mention that after upgrading the package, for unknown (yet) reason that my full dataset still produced the same error message, although the example above now produces expected results.

At this point I'll just tentatively skip fisher alpha estimation and work with other indices.

Thanks a again for your help!

jarioksa commented 3 years ago

The next version was not released yet (and won't be released for a long time). If you want to get the bleeding edge version prior to its release, you should download vegan from github.

dinhe878 commented 3 years ago

I thought that's what I did:

devtools::install_github("vegandevs/vegan")
> packageVersion("vegan")
[1] ‘2.6.0’

Or this is still the stable version?

jarioksa commented 3 years ago

I didn't build any full proof protection against failing estimation of Fisher α, but just blocked two most obvious bad cases: no species or one species. It is possible to have communities with two or more species that are so different from the Fisher abundance distribution model that estimation fails. For instance fisher.alpha(c(40,42)) fails. Fisher α is based on function fisherfit that tries to fit the Fisher abundance model to the observed abundance distribution and returns the single model parameter α as the diversity index. If the Fisher abundance model is absolutely unsuitable, an error may result. Even if there is no error, the estimate can be very unreliable. For instance, would you think that the fitted line and its single parameter α = 0.372 is a good descriptor of the data in the attached graph? Disabling calculation in the worst cases makes sense, but it is hard to decide where to put the limit of worst cases (0 and 1 species are always wrong, but two species can fit nicely, although never with high reliability). However, the index should not be used light heartedly, but the user should know what they want and what they do, or they end up shooting themselves in the foot. I think phyloseq should not calculate Fisher α unless specifically requested. fishy

dinhe878 commented 3 years ago

Thanks a lot @jarioksa for the explanation! It would be great if the error/warning messages could provide more information. As it is for now the error indicates more like an "technical error" rather than, say a warning "bad model fitting, i.e. fisher estimation may be unreliable". It might also make other packages easier to handle so any following steps won't be blocked because of a "appears to be" error raised.

vegandevs / vegan

fisher.alpha issue #412