reptalex / phylofactor

26 stars 9 forks source link

Raw data input to pylofactor #15

Closed SilasK closed 5 years ago

SilasK commented 5 years ago

Hey @reptalex

I started to experiment with your package, I really like it. I wanted to ask if it is necessary to add the relative abundant data into PhyloFactor (data that sums to 1). Because you anyway use a log transformation.

As I understand Microbiome Datasets Are Compositional: And This Is Not Optional, centic log transformation of the raw data would be more appropriate. What do you think?

reptalex commented 5 years ago

Hi Silas!

Great questions, and I'm glad you're enjoying the package!

First, you do not have to add relative abundance data into PhyloFactor. In fact, you can modify the input transform.fcn=I if you wish to analyze data in standard Euclidean space (i.e. without log-transformed data; the default is transform.fcn=log).

If you use the default log-transform and input data with zeros, there's a built-in mechanism for filling zeros (let 0 become 0.65*a where a is the minimum counts for that sample; for a sample with singleton OTUs this is just 0.65, for sample with the rarest species having 100 counts, this makes 0 into 65). There are far more techy zero-replacement schemes, so if you have strange data or particular preferences for how important the jump between 0 and 1 is, I recommend inputting data with all zeros replaced by your preferred method. As a side-note, let 0<=t<=1 be the value for zero replacements (in this case, all zeroes replaced by t). As t goes to 1, zeros become irrelevant - the jump between 0 and 1 is not a jump at all, so all the variance in the data will be changes in known abundances. As t goes to 0, log(t) goes to minus infinity and so the difference between log(t) and log(1) becomes infinitely large - in this case, the jump between 0 and 1 becomes the most important feature of your data. In other words, zero-replacement allows you to tune between what is effectively an abundance-only analysis (t=1) to a presence-absence analysis (t-->0).

Second, on the relationship between the CLR transform and the ILR transform (all coming back to whether/not one needs to use compositional data), it's important to note that the ILR transform (as used in PhyloFactor) is just a change of basis of the CLR transformed data. In other words, the data are in the same points in space, but you just use different axes to explain where they are. You can see this in the phylofactor object's pf$basis - it's a matrix whose columns are vectors orthogonal to the 1 vector (equivalently, this is orthogonal to the mean, i.e. the vector 1/D for D species). That orthogonality means that any "non-centered-ness" of the data on a log scale is not relevant for projecting the data onto the ILR basis.

To see this, denote v the ILR basis vector and x the mean-centered log data (e.g. the clr-transformed) vector. Suppose we didn't center it, meaning we didn't subtract the mean, c, from each element of x. In this case, we're looking at the vector

y=x+1c

And, using the orthogonality of the ILR basis element v and 1, the projection of y onto v will be

v'y=v'x+v'1c=v'x

Which shows the invariance of ILR projections to constant translations along the 1 vector, such as mean-centering.

Greg, Vera & Juan-Jo (the authors of that paper) are all aware of this, and likely said (or intended to say) that one should CLR-transform their data before using standard multivariate techniques (e.g. PCA, factor analysis and other funkiness). Notice that for PCA, one should always mean-center their data anyways, so PCA usually requires an input that is centered (although perhaps not log-transformed) data. The "centering" is good for all techniques which do not build off of vectors v orthogonal to 1, and the log-transform is good for all data which are appropriately heteroskedastic (variance growing like the square of the mean) or geometric.

Hope this helps you make sense of everything!

Best, Alex

P.S. If you want to take things to the next level, the function gpf allows one to do explicit, count-based phylofactorization (e.g. zeros are dealt with using assumptions of the count distribution). Generalized Phylofactorization (connecting phylofactorization with generalized linear models) is discussed in an upcoming Ecological Monograph, and currently available here on Biorxiv.

reptalex commented 5 years ago

P.P.S. I want to doubly emphasize that the zero-replacement of 0.65 is COMPLETELY ARBITRARY - as arbitrary as significance values of P=0.05; someone did it once years ago and everyone has used it since because conventions help us bypass battles over arbitrary yet consequential parameters.

I personally think that assumptions about distributions (e..g Poisson, Negative Binomial) are equally arbitrary and done largely for convenience. Such assumptions give a fixed probability of 0 for a given parameterization. Zero-inflated models will try to estimate the probability of being present at all and then make an additional assumption of a count distribution given the sequence/species is present. How small/important is a zero? Is it half of 1 (t=0.5), or is it one tenth of 1 (t=0.1)? There's no clear answer for this, and, while PhyloFactor has its own scheme, I encourage, at the very least, awareness of its arbitrariness and consideration of alternatives. If any one method pops out as a conventional one for these count data, I intend to incorporate it into the "black-box".

Science!!!!

SilasK commented 5 years ago

I think this is the most comprehensive message I got fro a github issue. Thank you very much.

So CLT is a special version of ILR transformation. So can I simpli input the raw counts and they are transormed by the ILR ? Or do I need to use ILR before and transform.fcn=I ?

reptalex commented 5 years ago

The CLR data are as linearly dependent as relative abundance data - with D species and D-1 degrees of freedom (in a relative abundance vector) due to compositionality, you'll have D CLR coordinates with each coordinate corresponding to a species/OTU and the data still only have D-1 degrees of freedom. The ILR transform maxes out at D-1 coordinates with each coordinate usually corresponding to a node in a sequential binary partition of the set of species. PhyloFactor will yield the same number of coordinates as pf$nfactors. The CLR transform isn't quite a special version of the ILR transform, as the basis elements aren't orthogonal. Instead, the CLR transform projects relative abundance data onto an unbounded plane connecting Aitchisonian geometry of the simplex to Euclidean geometry on the plane, allowing one to use standard Euclidean-based tools (e.g. subtract vectors to measure how different/distant they are, as opposed to taking ratios).

A good exercise to get a comprehensive picture of the CLR and ILR transforms is to figure out which set of vectors, {w}, one needs to project a log-transformed dataset X (don't worry about compositionality) onto in order to yield CLR-transformed data,

{y|y=w'X for some w above}, yielding a new dataset Y=W'X

The matrix whose columns are these vectors, W, is square but it is not an orthogonal matrix nor is it full rank (the vectors, {w}, are linearly dependent). Each vector w is orthogonal to 1, but not to one-another. For comparison, you can write out the ILR basis one would construct with phylofactorization with a star phylogeny (a tree with every species radiating from a single common ancestor with no intermediate nodes), picking edges arbitrarily.

For a more topological view, the CLR transform stretches a simplex - an awkward, bounded triangle in the positive orthant - into a plane. The simplex is a D-1 dimensional bounded plane, the log-simplex is a D-1 dimensional warped surface (not a plane), but the CLR-simplex is an unbounded, D-1 dimensional plane for which geometric shifts on the simplex become arithmetic shifts on the plane (e.g. clo(a(x)x), where (x) represents element-wise multiplication, becomes y+c for positive vector a, simplex-valued vector x, and CLR-plane vectors y and c). An ILR transform is any D-1 dimensional, orthonormal basis for the CLR plane (with a special case being those bases, like the ones PhyloFactor constructs, which correspond to a sequential binary partition).

On the topic of inputting raw/clr-transformed data: you can input raw counts directly into PhyloFactor with appropriate caution for total sequence counts - you don't need to rarify your data, but it's probably wise to have a minimum number of sequence counts for inclusion of a sample - and also with agreement over the treatment of zeros.

The results from inputting raw data (with zeros replaced) into PhyloFactor will be equivalent to inputting the CLR transformed data (or even just log-transformed data) and setting transform.fcn=I. On the topic of zeros, you'll notice the need to replace zeros before taking the CLR transform, hence zeros are always an issue for these count data.