interface with SAMM or bypass Galton branching

Lan-h commented 1 year ago

Hi, Thank you for your tool. I am more interested in mutability likelihood than the one computed by default after inferring the parameters of the galton branching process. I tried to use the .p file (either returned by phylip_parse or gc infer) but it seems the other tool from your lab, SAMM (https://github.com/matsengrp/samm), cannot take as an input a whole parsimony forest. Is there a way to parse the .p file to single trees that could be processed by SAMM or use gc infer mutability options without computing the 2 galton parmeters (my trees are quite huge, and this step can be time intensive)? By the way, is there a difference between the mutability likelihoods computed by SAMM and gctree ?

Thank you in advance for your help!

willdumm commented 1 year ago

I'm not very familiar with the workings of SAMM, but hopefully I can help answer your questions about gctree.

I am more interested in mutability likelihood than the one computed by default after inferring the parameters of the galton branching process.

It is possible to change how trees are ranked, by passing three ranking coefficients to the --ranking-coeffs option of gctree infer. Mutability parsimony corresponds to the second coefficient. Unfortunately, the coefficient for log likelihood is fixed at -1, but you could do something like --ranking-coeffs 0 1000 0 to make mutability parsimony dominate ranking.

Is there a way to parse the .p file to single trees that could be processed by SAMM

The inference pipeline outputs top-ranked trees as pickled CollapsedTree objects to a file ending with .inference.<n>.p. A CollapsedTree contains an ete3 tree stored in the attribute tree.

If you want access to all the parsimony trees that are being ranked, not just the best ones with respect to ranking criteria, you can extract trees from the pickled CollapsedForest (which should be output in a file ending in .inference.parsimony_forest.p) by iterating over it, e.g. for ctree in forest. ctree will then be a CollapsedTree instance.

Depending on your data, the pickled forest may contain too many trees to iterate over. If this is the case, there are ways to filter them efficiently according to criteria you care about (although this is getting into internals and won't be straightforward), or you can call CollapsedForest.sample_tree(), which will return a random CollapsedTree from the forest.

use gc infer mutability options without computing the 2 galton parmeters (my trees are quite huge, and this step can be time intensive)?

Unfortunately it's not possible to run gctree infer without fitting branching process parameters. However, the pickled forest output by phylip_parse can be sampled from and iterated over as described above.

By the way, is there a difference between the mutability likelihoods computed by SAMM and gctree ?

Yes. Gctree uses something we're calling mutability parsimony, which is a sum over tree branches of a sum over sites of the negative log of the mutability of the central base in its 5-mer in the parent sequence, times the targeting probability of the child base given that 5-mer. Sites without mutations contribute no weight to this score, but it attempts to give higher weight to mutations which are less likely according to a mutation model. It isn't really a likelihood, though. What SAMM does is more principled.

I hope this helps, but let me know if you have more questions.

matsen commented 1 year ago

Thanks, @willdumm .

@Lan-h , we are still exploring these various flavors of likelihoods. I don't think that mutability parsimony is going to be our eventual solution, so I'd advise waiting on using this feature.

matsen commented 1 year ago

Oh, and would you mind describing your use case, @Lan-h ?

Lan-h commented 1 year ago

Thank you very much for these quick replies!

@matsen, I chose mutability based ranking of trees based on the BCR trees benchmarking paper published in JI in 2018, but maybe the methods have been updated since then... For the moment, I'm preprocessing the sequences and generating the parsimony forest with Gctree which has convenient functions. I would like to infer trees both for regular GCs, in which cases the trees can be computed quickly, and in lymphomas, where the neoplasic clone can feature hundreds of unique sequences. I need a single tree that would be used for another analysis. I can provide an example fasta file if needed.

@willdumm, I tried both ranking coefficients within GCtree and switching to SAMM Here is the command line I used to rank trees according to mutability with GcTree:

gctree infer outfile abundances.csv --root Ancestor --frame 1 --idmap idmap.txt --idlabel --verbose

gctree infer gctree.out.inference.parsimony_forest.p --frame 1 --idmap idmap.txt --idlabel --mutability $mutability_file --substitution $substitution_file --ranking_coeffs 0 1000 0 --outbase newranking --tree_stats --verbose

I only get one tree (gctree.out.inference.1.nk), even though several of them are mentioned on the --tree_stats output newranking.tree_stats.csv. Also, the --tree_stats output is incompatible with the generation of single tree files (nk, svg, .p ...), I do not know whether this is some kind of bug. Do you know how to get both the likelihood csv file and all the single trees (not just 1) ? I think that in the former version (I haven't used it myself), this could be done by inactivating the --quick parameter, but I could not find an equivalent in the new version.

Concerning the switch to SAMM, I tried to read the .p file as in the post above:

"If you want access to all the parsimony trees that are being ranked, not just the best ones with respect to ranking criteria, you can extract trees from the pickled CollapsedForest (which should be output in a file ending in .inference.parsimony_forest.p) by iterating over it, e.g. for ctree in forest. ctree will then be a CollapsedTree instance."

Here is the code that I'm using within a python script:

from samm_rank import likelihood_of_tree_from_shazam

with open("gctree.out.inference.parsimony_forest.p", 'rb') as Parsimony_Forest:
    #Parsimony_Forest = pickle.load(Parsimony_Forest) #unsupported pickle protocol
    for ctree in Parsimony_Forest:
        print ctree
        likelihood = likelihood_of_tree_from_shazam( ctree, mutability_file, substitution_file)
        print likelihood

This was not enough to have access to the content of the .p file and got pickle protocol error messages. It might be due to incompatibilities between Python versions in Samm and Gctree, I'll try to figure out a way to make it work.

I remain at your disposal for any further information

willdumm commented 1 year ago

@Lan-h Sorry for the delay in responding. I suspect you were indeed unable to unpickle the parsimony forest due to Python version issues. I'm curious if you have any trouble loading the pickled forest in a Python 3.9 environment with the latest version of gctree installed? Gctree is Python 3 only, and SAMM is Python 2 only, so it's possible that to export a tree from gctree to SAMM you'll need to write it to newick and re-load it in the Python 2 environment in which you're running SAMM. The CollapsedTree object contains an attribute tree which points to an ete3.Tree object. The Ete3 package makes it easy to export newick strings.

I only get one tree (gctree.out.inference.1.nk), even though several of them are mentioned on the --tree_stats output newranking.tree_stats.csv. Also, the --tree_stats output is incompatible with the generation of single tree files (nk, svg, .p ...), I do not know whether this is some kind of bug.

I'm not sure what you mean by incompatible, but the output from --tree_stats is related to the ranking of all trees which were ranked by gctree, or equivalently all the trees in the pickled parsimony forest that you describe trying to load. However, the individual pickled tree files output by gctree infer are only the top-ranked trees, so it would be expected that only one such individual tree would be written.

Let me know if you have any other questions. I'll be more responsive next time.

Lan-h commented 1 year ago

Thank you for your answer.

I'm running the script either from the docker from the tutorial https://matsengrp.github.io/gctree/install.html or a custom docker with both python versions, SAMM, Gctree ...

What I meant by incompatible is that when I add the --tree_stats option, I only get the tree_stats.csv output (and no gctree.out.inference files .p, .svg, .nk, .fasta) even if it mentions 40 trees with similar scores. I have to launch it again without the --tree_stat option to be able to get the other outputs. This is not my main issue though....

I am really surprised to get only one tree, since many of them are equally likely, according to the tree_stat.csv. Also, I am running some tests on a dataset on which a collegue has run gctree 2 years ago and obtained 15 different gctree.out.inferences. I wonder if there has been a major change in the gctree package since then. According to the tutorial, "If more than one parsimony tree is optimal, then up to ten optimal trees will be sampled randomly". Is there a way to extend this list, and what is the score range that is allowed for these trees to be sampled ?

willdumm commented 1 year ago

What I meant by incompatible is that when I add the --tree_stats option, I only get the tree_stats.csv output (and no gctree.out.inference files .p, .svg, .nk, .fasta) even if it mentions 40 trees with similar scores. I have to launch it again without the --tree_stat option to be able to get the other outputs.

This is definitely a bug. I'll be sure to fix it in the next release

I am really surprised to get only one tree, since many of them are equally likely, according to the tree_stat.csv. Also, I am running some tests on a dataset on which a collegue has run gctree 2 years ago and obtained 15 different gctree.out.inferences. I wonder if there has been a major change in the gctree package since then. According to the tutorial, "If more than one parsimony tree is optimal, then up to ten optimal trees will be sampled randomly". Is there a way to extend this list...?

Ah yes I see now what you mean. Since version 4, gctree uses a new internal representation of collections of trees which makes it very efficient to find the optimal trees. This representation also finds sometimes many more maximally parsimonious trees than dnapars does. This means that sometimes there are thousands of trees which all maximize the likelihood. In order to avoid outputting possibly thousands of equally optimal trees, we switched to sampling one tree from each optimal topology class.

In your case, this means that although tree_stats.csv shows that there are maybe 40 trees which maximize the likelihood, all 40 of those trees have the same topology (they differ only by ancestral sequence reconstructions) and only one is randomly sampled and saved. Unfortunately there is no option to save all of those trees. Currently the easiest (although not generally very efficient) way to recover all of them would be to iterate through the trees in the pickled parsimony forest, and filter them by likelihood by calling the ll method on each one, using the branching process parameters which were fitted during inference. A flag could be added to do this more easily, and I'll make an issue to do so.

If you'd like to recover the previous gctree behavior immediately, just revert to a version before v4.0.0.

, and what is the score range that is allowed for these trees to be sampled ?

Only trees that optimize the branching process likelihood (or whatever ranking criteria you're using) will be output at the end of inference. If you want to retain trees which have suboptimal branching process likelihood, you will need to filter them one-by-one by iterating through the pickled forest object.

I hope this is all clear, let me know if you have any more questions!

matsengrp / gctree

interface with SAMM or bypass Galton branching #104