lczech / gappa

A toolkit for analyzing and visualizing phylogenetic (placement) data
GNU General Public License v3.0
56 stars 7 forks source link

Best way to obtain tip label in edgePCA #23

Closed adriaaula closed 11 months ago

adriaaula commented 1 year ago

Hello!

I am playing around with gappa analyze edgepca, and I wanted to work programatically with which edges are the ones contributing the most for each component. I want to have an overview of the taxonomic groups that are important.

This information is coded in the transformation.csv but I am having a hard time connecting it to the data in the tree.

My tree presents 536 tips, 534 internal nodes and 1069 edges.

The transformation.csv file presents 534 columns. The first one presents the eigenvalues, and the following ones (533) are the eigenvectors, pointing us which is the influence of each eigenvector for that principal component.

In the paper, there is this linkage between eigenvectors and edges:

Each eigenvector can be displayed on the tree, because the coordinates of the eigenvector correspond to internal edges of the tree.

But then I would expect 1069 values in the eigenvector, way more than the 533 columns present in the file. What am I missing? In the nexus tree, the color is displayed in each label, but I am unable to retrieve the value easily. I have converted it to NHX format (with sed 's/\[&!/\[\&\&NHX:/g' and then cutting the tree) and then I can easily obtain the color palette for each edge, but afterwards I would need to catch with a regex from the stdout the palette for the relationship with the eigenvector.

In summary:

I have attached the files I have used here. abceprojection.csv abcetransformation.csv abcetree_0_nexus.txt

lczech commented 1 year ago

Hi @adriaaula,

I am playing around with gappa analyze edgepca, and I wanted to work programatically with which edges are the ones contributing the most for each component. I want to have an overview of the taxonomic groups that are important.

Very nice idea!

This information is coded in the transformation.csv but I am having a hard time connecting it to the data in the tree.

Indeed, that was not well developed, thanks for giving me the incentive to fix this ;-). Also, while I was looking into the issue, I found that there was a long-standing bug in there - where this table mixed up the order of those values. Please excuse this, and update gappa - it is fixed now in release gappa v0.8.4. I hope you were the first one to actually use that table, and that the bug does not have consequences for previous users...

So, now, the command produces a few new output files, see here. I also added a more detailed explanation of those values, and why they do not correspond to any number of inner or total edges in the tree - see there to (hopefully) find all the answers to your questions.

In particular, I have added a tree with the edge indices annotated, which allows to establish a connection between the eigenvalue table and the edges. Furthermore, as you suggested, there are NHX trees with these values produced now as well for each PCA component. Those should allow you to process the data programmatically. They can also for example be visualized in iTOL to get a similar representation as the colored tree files that were already part of the command:

Screenshot from 2023-04-27 22-28-00

If you really want to get into detail, apart from the EdgePCA paper itself, I recommend having a look at my PhD thesis, which in Sections 2.5.3-2.5.5 contains more details on the imbalances and the Edge PCA. Furthermore, the EdgePCA implementation in genesis might help to get the complete detail. Lastly, when working with annotated Newick trees, I think you might also want to have a look at our tree viewer review paper, as there are some pitfalls with the file format that you might want to be aware of.

That was a lot. Hope that helps! Lucas

lczech commented 1 year ago

Oh, some more things.

The Edge PCA uses a transformation of the placements per branch into a per-clade measure called "imbalances", as mentioned above. There is also a similar measure called "balances" (super confusing that both are so similarly named... but that's what the literature came up with), which has some advantages, and can be used in a similar fashion, also for a PCA (not implemented in gappa at the moment though - let me know if that would be interesting for you).

Also, if your goal is to find taxonomic groups that are important, I recommend having a look at the Placement-Factorization method, which is meant for exactly that purpose, but additionally takes meta-data such as environmental variables into account. It then finds the branches across which the differences in placements abundances correlate with these meta-data.

For both, I recommend Chapters 6-7 of my thesis. Please excuse the shameless self-promotion here :-) - I think though that it's the most detailed description of those methods, for which there was not enough room in the original publication.

Cheers Lucas

adriaaula commented 1 year ago

Hi @lczech. Thanks a ton for all the new options, this is really helpful!

And for your self-promotion, in spanish we would say something like Faltaria mas! which traduced would be like 'you are the most valuable source for that, the least we can do is read it!'

Thanks again and I let you know If I am able to make it work

adriaaula commented 1 year ago

Hi Lucas,

I have tested the new version and it works like a charm. I have however one more request: would it be possible to have another option to allow the taxonomic labeling of the inner nodes? As in the output of gappa examine assign, the labeled tree output.

It would make the analysis more straightforward with the NHX files.

If you really want to get into detail, apart from the EdgePCA paper itself, I recommend having a look at my PhD thesis, which in Sections 2.5.3-2.5.5 contains more details on the imbalances and the Edge PCA.

Thanks a ton for the thesis, it is really thorough and explained graphically which helps a lot in many parts. And thanks also with the tricks regarding the pitfalls for newick too!

The Edge PCA uses a transformation of the placements per branch into a per-clade measure called "imbalances", as mentioned above. There is also a similar measure called "balances" (super confusing that both are so similarly named... but that's what the literature came up with), which has some advantages, and can be used in a similar fashion, also for a PCA (not implemented in gappa at the moment though - let me know if that would be interesting for you).

In other literature they went with amalgamations for the imbalances! Albeit more specifically it would be the sums of logratios, and the amalgams an specific approach to that, such as the edgePCA. See here. And given our analysis (comparing housekeeping genes) it makes sense to use imbalances rather than balances because it gives more intuitive results and gives more weight to abundant taxons.

From the paper:

Recently, Greenacre et al. have challenged the interpretability of balances (19). We summarize the Greenacre et al. critique as follows: because the geometric mean depends on the ratios of the parts within, balances are not balances in the plain English sense of the word. Consider the balance between ‘a and b’ versus ‘c’whereb > c. We would expect that the ‘balance’ would lean toward the combined weight of ‘a’ and ‘b’. However, with a geometric mean, the balance will tip more toward ‘c’ when ‘a’ is rare. This is because the ilr balances are defined in log space. Instead of balances, Greenacre et al. proposed the summed log-ratio (SLR) as a more interpretable alternative (20).

The main difference I guess is that edge imbalances are from the edges, and not from the collections of tips/nodes.

Also, if your goal is to find taxonomic groups that are important, I recommend having a look at the Placement-Factorization method, which is meant for exactly that purpose, but additionally takes meta-data such as environmental variables into account. It then finds the branches across which the differences in placements abundances correlate with these meta-data.

Yes, I have seen it. In our study we want unconstrained analyses of the data, but for later analysis I will use it :)

I hope it is possible to include taxonomic info for the inner labels :)

adriaaula commented 1 year ago

One last thing: for the edgePCA analysis I am working with accumulated jplace, as in the output of gappa edit accumulate. It should not modifiy in a great deal, but does it make sense to you?

lczech commented 1 year ago

Hi @adriaaula,

thanks for your patience, I was a bit swamped in recent weeks.

I have tested the new version and it works like a charm.

Happy to hear!

I have however one more request: would it be possible to have another option to allow the taxonomic labeling of the inner nodes? As in the output of gappa examine assign, the labeled tree output. It would make the analysis more straightforward with the NHX files.

Hm, that is tricky, as those annotations are not part of the input jplace file, so I'd have to add an option to read an external annotated tree, and map its NHX annotations onto the tree in the jplace file... That is possible, but I'm afraid I'm a bit short on time to implement that at the moment. Get me some funding to do software maintenance though, and I'll be happy to do this :-D (an ongoing struggle in the bioinformatics world it seems, lack of funding once the paper is out...). Do you have other means of solving this for your use case?

In other literature they went with amalgamations for the imbalances! Albeit more specifically it would be the sums of logratios, and the amalgams an specific approach to that, such as the edgePCA. See here.

Thanks for sharing that paper, I'll look into it. Do you think that amalgams would be an interesting third way of transforming placement masses that could be useful for downstream analyses?

The main difference I guess is that edge imbalances are from the edges, and not from the collections of tips/nodes.

Not sure that I understand. In the context of phylogenetic placement, I've adapted the way the balances are computed, to also use the placement masses on the edges, and not the nodes. So, in a sense, balances and imbalances use the same data (placement masses on the edges), but do a different transformation on them (imbalances: difference of sums; balances: ratio of geometric means).

Yes, I have seen it. In our study we want unconstrained analyses of the data, but for later analysis I will use it :)

What do you mean by "unconstrained"? The way that Placement-Factorization operates on smaller and smaller clades in later iterations of the factorization would be "constrained" in your sense?

One last thing: for the edgePCA analysis I am working with accumulated jplace, as in the output of gappa edit accumulate. It should not modifiy in a great deal, but does it make sense to you?

The Edge PCA can run on any collection of jplace files (as long as they all use the same reference tree), so it's up to you what your data mean. Typically, I'd say that you want each jplace file to represent one unit or sample of interest. Whether that has been accumulated from multiple files before does not matter to the method - but it matters in terms of what you want to show with it, and how you interpret the results. Each output entry will be one accumulated sample in your case - so if that's the "unit" you need for your interpretation, that makes sense.

Cheers and so long Lucas

adriaaula commented 1 year ago

Hi Lucas :)

Do you have other means of solving this for your use case?

I found a simple way to solve it. Given that I can run gappa examine assign to obtain the labelled tree, I then have just joined both the labelled tree and the new trees with the eigenvalues that came with your new release.

One more question regarding the eigenvalues: image

In this result I can state that all this edge eigenvalues contribute almost equally to this principal component. Is that correct? I wonder if I can focus to the most outer edge if there is not a drop on the eigenvalue. Given that all this values carry more or less the same information, no? They are explanatory of the same variance, the placements located in the Obazoa / Opistokhonta part.

Do you think that amalgams would be an interesting third way of transforming placement masses that could be useful for downstream analyses?

I don’t have a clear response. On the basis of what I read, I think that amalgamations do not take into account phylogenetic data and therefore are on a different fashion that phylofactorization or other approaches. Given that gappa focuses mainly in phylogenetic information maybe it doesn’t make that much sense.

Not sure that I understand. In the context of phylogenetic placement, I've adapted the way the balances are computed, to also use the placement masses on the edges, and not the nodes. So, in a sense, balances and imbalances use the same data (placement masses on the edges), but do a different transformation on them (imbalances: difference of sums; balances: ratio of geometric means).

Basically, the methods I was talking about work with OTU-table matrices, and afterwards perform calculations of balances, imbalances and similar approaches. They are therefore working with node-values, since the OTU would be the tip in the phylogenetic trees. I think this is one of the main differences.

What do you mean by "unconstrained"? The way that Placement-Factorization operates on smaller and smaller clades in later iterations of the factorization would be "constrained" in your sense?

For the analysis I am performing (comparison of housekeeping genes expression patterns, I am working with Daniel Richter :)) I do not want to use metadata values such as pH, temperature and other to understand how they influence my values. For constraining I was specifically talking about this situation, in which you try to disentangle which branches are differentiated over a parameter of interest, constraining the analysis to that. Sorry, I was not clear.

The Edge PCA can run on any collection of jplace files (as long as they all use the same reference tree), so it's up to you what your data mean. Typically, I'd say that you want each jplace file to represent one unit or sample of interest. Whether that has been accumulated from multiple files before does not matter to the method - but it matters in terms of what you want to show with it, and how you interpret the results. Each output entry will be one accumulated sample in your case - so if that's the "unit" you need for your interpretation, that makes sense.

Yes, each jplace file represents one sample of interest. Rather than accumulating over multiple files, I was referring to the accumulation of LWR for each specific placements:

image

I am performing this before performing the edgePCA, and I wonder if thats necessary. I think it is not, but I wanted confirmation :)

lczech commented 1 year ago

Hi @adriaaula,

In this result I can state that all this edge eigenvalues contribute almost equally to this principal component. Is that correct?

By "edge eigenvalues" you mean the individual components of the first eigenvector, or where do these numbers come from?

I wonder if I can focus to the most outer edge if there is not a drop on the eigenvalue.

If the values in your table are the first eigenvector, then these are the values that are also visualized on the colored tree produced by the command, as seen in the tree figure here. The left hand tree is the first component, and the right hand one the second component. As you can see, in the first tree, there is a path down to the clade that is most responsible for this PCA component, but in that figure, it splits into two sub-paths.

So, with respect to your analysis, it depends a bit on what you want to do there. If your values really stay (roughly) the same all the way down to an outer edge, then that indicates that abundance differences outwards of that particular edge explain a lot of variance. However, you might also encounter cases where the signal is spread out a bit more across a clade - depending on what you want to achieve, you might need to take care of that. For example, in the figure linked about, this splitting into two sub-paths towards the outer part of the tree indicates that there are two taxa whose abundance differences are driving this PCA component, instead of just one.

To deal with this, for example, in Placement-Factorization, we pick a "winning" edge - the one with the highest value (computed with a GLM in that case, but similar argument). That could be an outer edge, or more inwards in the tree - in which case it would describe a whole clade.

So, in short, it depends on what you want to do here. If you want some form of taxonomic assignment, using the edge with the highest value in its first eigenvector (the "winning" one in Placement-Factorization terminology), and going with the highest level taxonomic label that the edge corresponds to (using the consensus of all leaf edge taxonomic labels of that clade) might be a good approach.

Basically, the methods I was talking about work with OTU-table matrices, and afterwards perform calculations of balances, imbalances and similar approaches. They are therefore working with node-values, since the OTU would be the tip in the phylogenetic trees. I think this is one of the main differences.

Ah okay, thanks for the explanation. Yeah, I see the point, but then again, it does not make much of a difference, does it? There is a 1-to-1 correspondence between nodes and edges, where each node has an edge leading towards the root, so this data can be interpreted and used in the same way. Like - we could assign all edge masses of the placements to the node outwards from that edge, and would end up with data that can be used with node-based methods. Maybe that's useful in your case ;-)

... in which you try to disentangle which branches are differentiated over a parameter of interest, constraining the analysis to that.

Ah okay. I don't think that "constraining" is the word that best fits here (but I'm not a mathematician, so take it with a grain of salt). Using such explanatory variables is not constraining your analysis - it's seeing how much of the variance in your data can be explained by variance in those meta-data. But yes, this is disentangling the data, or in some sense, controlling for external factors.

Rather than accumulating over multiple files, I was referring to the accumulation of LWR for each specific placements. I am performing this before performing the edgePCA, and I wonder if thats necessary. I think it is not, but I wanted confirmation :)

Oh interesting. What is the reason that you perform the accumulation first? It of course depends on what you want to do there exactly. But without any more context, I'd advise against this, because you are losing information. By accumulating masses towards the inner branches, you lose detail on the outer ones.

I am working with Daniel Richter :)

Haha niiiice, say hi!

Hope that helps, cheers, and so long Lucas

lczech commented 11 months ago

Hey @adriaaula,

I assume closing the issue means you figured out a way? Curious to see what you came up with! If you want, let me know what you did there :-)

Cheers Lucas

adriaaula commented 11 months ago

Hi!

No, I didn't solve it. Given that I will be working with clades of interest, I continued without trying to work with one single edgePCA applied to 100s of components, but rather multiple edgePCAs applied to each different clade. Maybe the best option is the first one though, but since I want to compare patterns between genes at different taxonomic ranks this felt more intuitive for me.