Predictions using new data

earmingol commented 5 years ago

Hi @reptalex ,

I have a new issue related with predicting with new data:

First, I created this object:

pf_PhyloFactor <- PhyloFactor(train_OTUs, filtered_tree, train_MetaData, frmla = PHENOTYPE~Data, nfactors=factors, choice='F' ,ncores=ncores, family='binomial')

After running phylofactor and generating the object "pf_PhyloFactor" (in this case, from using an OTU table named train_OTUs), I can perform predictions without problems if I use this command:

predict(pf_PhyloFactor, newdata = NULL, type = "response")

In this case, I get predictions based on the data used for generating pf_PhyloFactor (i.e. train_OTUs). However, when I try to use a new data (e.g. test_OTUs) with the same structure as train_OTUs (same OTUs or Species but different samples) and I use:

predict(pf_PhyloFactor, newdata = test_OTUs, type = "response")

I have the following error:

Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): invalid type (list) for variable 'Data'
Traceback:

1. predict(pf_PhyloFactor, newdata = test_OTUs, type = "response")
2. predict.phylofactor(pf_PhyloFactor, newdata = test_OTUs, type = "response")
3. do.call(stats::glm, args) %>% stats::predict(...)
4. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
5. eval(quote(`_fseq`(`_lhs`)), env, env)
6. eval(quote(`_fseq`(`_lhs`)), env, env)
7. `_fseq`(`_lhs`)
8. freduce(value, `_function_list`)
9. withVisible(function_list[[k]](value))
10. function_list[[k]](value)
11. stats::predict(., ...)
12. predict.glm(., ...)
13. predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == 
  .     "link", "response", type), terms = terms, na.action = na.action)
14. model.frame(Terms, newdata, na.action = na.action, xlev = object$xlevels)
15. model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels)

I also read in the tutorial that I have to create a new column with 'Species'.

I tried this:

sp_ <- test_OTUs
sp_['Species'] <- rownames(test_OTUs)
predict(pf_PhyloFactor, newdata = sp_, type = "response")

And I got exactly the same error as before.

Do I need any additional step to perform predictions with the new data?

Thank you!

reptalex commented 5 years ago

Hi @earmingol - thanks for finding this and pointing it out!

I'll take a closer look, but I'm 95% sure I intended the prediction functionality for PhyloFactor output to only build a data matrix (i.e. Data~PHENOTYPE), something I forgot to specify in the help file.

The risk of burying predictions of PHENOTYPE under the hood of PhyloFactor is that PhyloFactor can produce up to D-1 factors, where D is the number of species in the dataset. Typically, at least in the microbiome world, there are far more species than observations. Consequently, D-1 contrast basis elements and their projections can be obtained in a stagewise fashion, but using nfactors of these together to predict n phenotypes, for n<<nfactors, will produce an underdetermined system.

If you want to use the interpretable features of phylofactor to predict PHENOTYPE, though, you have a lot of options. To access those options, you can produce a matrix of some integer k features with

X <- t(pf$basis[,1:k]) %*% log(pf$Data)

The kth row of X corresponds to your kth phylofactor. You can then transpose X, align it with a column vector, y of phenotypes, and use your method of choice for predictions (e.g. glm, gam, etc.) with the formula y~s(X) for some function (or smoothing spline) s(X).

reptalex commented 5 years ago

Alternatively, if you want to see what factor k predicts, you can do

predict(pf$models[[k]],...)

earmingol commented 5 years ago

Thanks @reptalex ! This was very useful :)

I will try the approach of computing X and generating a model from that.

earmingol commented 5 years ago

An update. I tried the approach of using:

X$Data <- t(pf_PhyloFactor$basis[,1:factors]) %*% log(pf_PhyloFactor$Data)

And then

logit <- glm(PHENOTYPE~Data, data = X, family='binomial')

To finally plug the new data and execute:

Y$Data <- t(t(pf_PhyloFactor$basis[,1:factors]) %*% log(new_data))
predicted_y <- predict(logit, newdata = Y, type = "response")

Where X and Y are the MetaData for the train and test (new_data) data

And it worked without issues!

Thanks @reptalex !

reptalex commented 5 years ago

Glad to hear it worked well for ya!

Also, note that if your goal is to predict phenotype, there is some pretty wicked functionality available to you through nonlinear models. You can do phylofactorization via a generalized additive model (method='gam' and a formula like Phenotype~s(Data)), allowing you to also find lineages whose intermediate abundance is predictive of phenotype. Then, you can produce the explanatory matrix of phylogenetic factors, X, as above, and input this into whatever kind of classifier you want (e.g. SVMs, randomForest, gbm, etc.).

The real crux you'll find in cross-validation analyses with micorbiome data are problems produced by the addition of new species. This issue first came up in the Microbiome Stress Project analyses, and then I worked with Lisa Marotz & Jamie Morton to flesh this out in an upcoming paper on reference frames - stay tuned! However, due to potentially shifting reference frames in microbiome training/testing datasets, I've found that different aggregation and contrast functions can be more stable to moving the distribution of counts around to new species (geometric means are something of a measure of how even the relative abundance within a lineage is distributed - adding new, rare species can change such evenness and thereby shift the reference frame a bit). For size, try out the input contrast.fcn=amalgamate and then produce your features using the groups, contrast.fcn, transform.fcn and Data on your phylofactor object:

X=sapply(pf$groups[1:k],pf$contrast.fcn,pf$transform.fcn(pf$Data))

If it comes up again, I might make the above one-line into a function getFeatures or something like that. Minor note: don't transpose this X before inputting it into other fitting tools in R.

earmingol commented 5 years ago

Wow, your help has definitely been comprehensive! Thanks again @reptalex !

With this, I was able to train a random forest model after using phylofactor :)

Soon I will create a repo with some analyses including an iris-like dataset for microbiomes as example and post it here. It may be useful for other people that want to use this tool :)

reptalex commented 5 years ago

Glad to hear it!

One thing to note is that a random forest with your CLR-transformed, full dataset will have a high probability of overperforming a random forest with a phylofactor-derived low-rank ILR-transformed dataset. The reason for this is that the CLR-transformed data of D species fall on a D-1 dimensional hyperplane, and all phylofactorization is doing is finding an alternative basis for that hyperplane.

I believe this doesn't hold for amalgamations, as the amalgamation (also called the SLR or summed log-ratio) is a nonlinear transformation.

Also, if a random forest of phylofactorization's ILR projection outperforms a random forest on the full CLR dataset in cross-validation, especially in which the features from the testing data fall outside the range of training data, the this produces some meaning that the alternative basis we've chosen may be orthogonal to Bayes decision boundaries or parallel to gradients for our response variable, as random forests (at least as implemented in the current R package) make piecewise constant fits over the space defined by nice, rectangular boundaries, and it thus finds very simple, piecewise constant projections whose boundaries are along the axes of our basis outside the space of our training data (think: Piet Mondrian art).

Science!!!

earmingol commented 5 years ago

Thanks again!! @reptalex

That is definitely something to implement in future analyses :)

Here l put some examples of using phylofactor with a "toy" dataset :) https://github.com/earmingol/Microbiome-Factorization

I hope this could be useful for other people.

reptalex / phylofactor

Predictions using new data #21