lczech / gappa

A toolkit for analyzing and visualizing phylogenetic (placement) data
GNU General Public License v3.0
56 stars 7 forks source link

Can gappa be used on metadata containing phylogenetic information of the host? #13

Closed Jigyasa3 closed 3 years ago

Jigyasa3 commented 3 years ago

Hey!

Can the correlations between the tree and the metadata for factorization accommodate phylogenetic information of the host as metadata? Eg. I want to compare if the microbial 16S tree can be factorized by host phylogenetic tree and diet information.

Thanks!

lczech commented 3 years ago

Hey!

I am not entirely sure that I fully understand your question. PhyloFactorization and our placement-based adaptation of it, Placement-Factorization, are for the following use case: You have a set of environmental samples, each of them coming with some meta data. That is, you have a table where each row is a sample, and different columns are different types of metadata. See for example: https://github.com/lczech/gappa/wiki/Subcommand:-placement-factorization#input-meta-data The method then identifies branches in the tree that "factor our" the phylogenetic placement distribution of each sample, that is, we find branches across which a difference in placement distribution of different samples correlates with a difference in metadata of these samples. This correlation is quantified via a Generalized Linear Model, which can work with different types of per-sample metadata, such as numbers (e.g., pH value of the environment where the sample was taken from), indicators (true/false, e.g., presence/absence of something), or categorical variables (a set of options, e.g. eye color, or whatever makes sense for the organisms you are working with).

So, in short, if your metadata (information about the host, diet information) are variables that you have for each of your samples, and if these variables are numerical, logical (true/false), or categorical, then the method should be able to work with that.

Does that answer your question? If not, please elaborate a bit, or maybe provide a small example of the data that you are working with.

Cheers Lucas

Jigyasa3 commented 3 years ago

Thank you @lczech for a detailed description! I understand the concept, and I have used phylofactor R package too. I was just wondering if it's possible to use the Phylogenetic generalized linear model (PGLS) using the host phylogenetic tree (time tree) as metadata? I am currently using PGLS in the Caper R package, and using different models I can test how the microbial trait has evolved overall across the host tree. For example, using BM or OU model, I can test if a microbial gene relative abundance has a phylogenetic signal across the host tree. But I am more interested in factorizing the host tree (similar to other metadata you mentioned above) based on microbial relative abundance to make some inference about the microbial function evolution across different branches of the host tree.

I was thinking of using the host tree instead of microbial 16S (or marker gene tree) for factorization, but I have 100s of microbial genes for a single host tip. Do you think gappa can be somehow modified for a dataset like this?

Looking forward to your reply!

lczech commented 3 years ago

Hey!

So, I am still not following you. I think it might just be a matter of terminology, as you keep referring to the tree as "metadata", which just doesn't make sense in the context of the Factorization terminology. Let's try to resolve this misunderstanding (whoever is misunderstanding whom here... maybe I just get it wrong).

I was just wondering if it's possible to use the Phylogenetic generalized linear model (PGLS) using the host phylogenetic tree (time tree) as metadata?

How can your tree be metadata? The algorithm needs a set of metadata features per sample. So, if you have a tree per sample (I don't know - I'm just trying to make sense of what you want), you could somehow attempt to represent that tree as a per-sample feature vector, and then yes, technically, you could feed that into the algorithm. Not sure though what this would tell you. You would factorize the branches of your actual tree that was used for placed by some set of per-sample trees (that I still don't understand). Is that what you have in mind?

...how the microbial trait has evolved overall across the host tree

That sounds like a more normal use case to me: you have some trait value for the samples, and factor the tree by that. That would be pretty straight forward.

But I am more interested in factorizing the host tree (similar to other metadata you mentioned above) based on microbial relative abundance to make some inference about the microbial function evolution across different branches of the host tree.

That also sounds reasonable to me. So, that sounds like you would want to take that tree, phylogenetically place your microbial samples on it to get relative abundances, and then use some metadata feature that represents your function evolution in order to do the factorization? Is that what you had in mind there?

I was thinking of using the host tree instead of microbial 16S (or marker gene tree) for factorization, but I have 100s of microbial genes for a single host tip. Do you think gappa can be somehow modified for a dataset like this?

Well, either tree should work, given the jplace file of your microbial genes being phylogenetically placed on that tree. Not sure what modification you would need for that. That is the standard use case, if I understand correctly.

Cheers Lucas

lczech commented 3 years ago

Hey @Jigyasa3, any news on this?

lczech commented 3 years ago

Hey @Jigyasa3, this issue does not seem to be active any more. Closing it for now - feel free to re-open as needed!