Closed earmingol closed 5 years ago
Hi @earmingol - thanks for finding this and pointing it out!
I'll take a closer look, but I'm 95% sure I intended the prediction functionality for PhyloFactor
output to only build a data matrix (i.e. Data~PHENOTYPE
), something I forgot to specify in the help file.
The risk of burying predictions of PHENOTYPE
under the hood of PhyloFactor
is that PhyloFactor can produce up to D-1
factors, where D
is the number of species in the dataset. Typically, at least in the microbiome world, there are far more species than observations. Consequently, D-1
contrast basis elements and their projections can be obtained in a stagewise fashion, but using nfactors
of these together to predict n
phenotypes, for n<<nfactors
, will produce an underdetermined system.
If you want to use the interpretable features of phylofactor to predict PHENOTYPE
, though, you have a lot of options. To access those options, you can produce a matrix of some integer k
features with
X <- t(pf$basis[,1:k]) %*% log(pf$Data)
The kth
row of X
corresponds to your kth
phylofactor. You can then transpose X
, align it with a column vector, y
of phenotypes, and use your method of choice for predictions (e.g. glm, gam, etc.) with the formula y~s(X)
for some function (or smoothing spline) s(X)
.
Alternatively, if you want to see what factor k
predicts, you can do
predict(pf$models[[k]],...)
Thanks @reptalex ! This was very useful :)
I will try the approach of computing X and generating a model from that.
An update. I tried the approach of using:
X$Data <- t(pf_PhyloFactor$basis[,1:factors]) %*% log(pf_PhyloFactor$Data)
And then
logit <- glm(PHENOTYPE~Data, data = X, family='binomial')
To finally plug the new data and execute:
Y$Data <- t(t(pf_PhyloFactor$basis[,1:factors]) %*% log(new_data))
predicted_y <- predict(logit, newdata = Y, type = "response")
Where X and Y are the MetaData for the train and test (new_data) data
And it worked without issues!
Thanks @reptalex !
Glad to hear it worked well for ya!
Also, note that if your goal is to predict phenotype, there is some pretty wicked functionality available to you through nonlinear models. You can do phylofactorization via a generalized additive model (method='gam'
and a formula like Phenotype~s(Data)
), allowing you to also find lineages whose intermediate abundance is predictive of phenotype. Then, you can produce the explanatory matrix of phylogenetic factors, X
, as above, and input this into whatever kind of classifier you want (e.g. SVMs, randomForest, gbm, etc.).
The real crux you'll find in cross-validation analyses with micorbiome data are problems produced by the addition of new species. This issue first came up in the Microbiome Stress Project analyses, and then I worked with Lisa Marotz & Jamie Morton to flesh this out in an upcoming paper on reference frames - stay tuned! However, due to potentially shifting reference frames in microbiome training/testing datasets, I've found that different aggregation and contrast functions can be more stable to moving the distribution of counts around to new species (geometric means are something of a measure of how even the relative abundance within a lineage is distributed - adding new, rare species can change such evenness and thereby shift the reference frame a bit). For size, try out the input contrast.fcn=amalgamate
and then produce your features using the groups
, contrast.fcn
, transform.fcn
and Data
on your phylofactor object:
X=sapply(pf$groups[1:k],pf$contrast.fcn,pf$transform.fcn(pf$Data))
If it comes up again, I might make the above one-line into a function getFeatures
or something like that. Minor note: don't transpose this X
before inputting it into other fitting tools in R.
Wow, your help has definitely been comprehensive! Thanks again @reptalex !
With this, I was able to train a random forest model after using phylofactor :)
Soon I will create a repo with some analyses including an iris-like dataset for microbiomes as example and post it here. It may be useful for other people that want to use this tool :)
Glad to hear it!
One thing to note is that a random forest with your CLR-transformed, full dataset will have a high probability of overperforming a random forest with a phylofactor-derived low-rank ILR-transformed dataset. The reason for this is that the CLR-transformed data of D species fall on a D-1 dimensional hyperplane, and all phylofactorization is doing is finding an alternative basis for that hyperplane.
I believe this doesn't hold for amalgamations, as the amalgamation (also called the SLR or summed log-ratio) is a nonlinear transformation.
Also, if a random forest of phylofactorization's ILR projection outperforms a random forest on the full CLR dataset in cross-validation, especially in which the features from the testing data fall outside the range of training data, the this produces some meaning that the alternative basis we've chosen may be orthogonal to Bayes decision boundaries or parallel to gradients for our response variable, as random forests (at least as implemented in the current R package) make piecewise constant fits over the space defined by nice, rectangular boundaries, and it thus finds very simple, piecewise constant projections whose boundaries are along the axes of our basis outside the space of our training data (think: Piet Mondrian art).
Science!!!
Thanks again!! @reptalex
That is definitely something to implement in future analyses :)
Here l put some examples of using phylofactor with a "toy" dataset :) https://github.com/earmingol/Microbiome-Factorization
I hope this could be useful for other people.
Hi @reptalex ,
I have a new issue related with predicting with new data:
First, I created this object:
After running phylofactor and generating the object "pf_PhyloFactor" (in this case, from using an OTU table named train_OTUs), I can perform predictions without problems if I use this command:
In this case, I get predictions based on the data used for generating pf_PhyloFactor (i.e. train_OTUs). However, when I try to use a new data (e.g. test_OTUs) with the same structure as train_OTUs (same OTUs or Species but different samples) and I use:
I have the following error:
I also read in the tutorial that I have to create a new column with 'Species'.
I tried this:
And I got exactly the same error as before.
Do I need any additional step to perform predictions with the new data?
Thank you!