Closed b-tierney closed 5 years ago
You could apply any of the "prune" or "keep" methods to trim the tree before calculating the distance matrix.
However, the bigger picture might change the approach. Do you plan to calculate the matrix for a fixed subset or a random subset, once or multiple times? This might change the strategy. Especially if find that, depending on the resources, it may not actually take so long to calculate the distance matrix relative to reading the tree.
On Mon, May 6, 2019 at 6:27 PM Braden Tierney notifications@github.com wrote:
Hi,
I have a massive tree (upwards of 100K microbes) and am concerned about memory limitations if I were to try to extract a phylogenetic distance matrix for all taxa in it. I'm hope to use a subsampling strategy, looking at only ~1000 taxa at a time. Do you have any advice for how to subset only pairwise distances like this?
Thanks so much in advance.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jeetsukumaran/DendroPy/issues/114, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGMR52OPLYJQN3PH7O23DPUDLJRANCNFSM4HLESX4Q .
--
Wow thank you so much for the fast response.
Ah so I can prune by taxon id? That sounds like what I need.
Current plan is to generate 10-20 random subsets...you're right that it may work out if I just use a big enough VM, but I'm currently not running at scale, still writing unit tests, so strategy definitely can change, I was considering at one point just building a brand new tree for each iteration. Being able to prune gives me a solid backup plan, though.
Let me know if you have any other ideas, I deeply appreciate it.
You can prune by taxon object using prune_taxa
or by taxon labels (which is what I presume you mean by "id") using
prune_taxa_with_labels
The latter is a convenience function that builds a container of taxon
objects by retrieving them from the taxon namespace by label, and then
calling prune_taxa
with that container, as you can see by the source
here:
https://dendropy.org/_modules/dendropy/datamodel/treemodel.html#Tree.prune_taxa_with_labels
.
If by"id" you really want to use internal id#'s (the Taxon.oid property)
then you would have the implement a variant of this to target that property.
Also note the complementary retain_taxa
and retain_taxa_with_labels
functions that prune all taxa except the ones that are passed to it as an
argument:
https://dendropy.org/_modules/dendropy/datamodel/treemodel.html#Tree.retain_taxa_with_labels
On Mon, May 6, 2019 at 6:49 PM Braden Tierney notifications@github.com wrote:
Wow thank you so much for the fast response.
Ah so I can prune by taxon id? That sounds like what I need.
Current plan is to generate 10-20 random subsets...you're right that it may work out if I just use a big enough VM, but I'm currently not running at scale, still writing unit tests, so strategy definitely can change, I was considering at one point just building a brand new tree for each iteration. Being able to prune gives me a solid backup plan, though.
Let me know if you have any other ideas, I deeply appreciate it.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeetsukumaran/DendroPy/issues/114#issuecomment-489874895, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGMR56T6K4WSFBW6MM2S3PUDNZRANCNFSM4HLESX4Q .
--
Spectacular, thank you so much
Hi,
I have a massive tree (upwards of 100K microbes) and am concerned about memory limitations if I were to try to extract a phylogenetic distance matrix for all taxa in it. I'm hope to use a subsampling strategy, looking at only ~1000 taxa at a time. Do you have any advice for how to subset only pairwise distances like this?
Thanks so much in advance.