phylogenetic distances with gigantic tree

b-tierney commented 5 years ago

Hi,

I have a massive tree (upwards of 100K microbes) and am concerned about memory limitations if I were to try to extract a phylogenetic distance matrix for all taxa in it. I'm hope to use a subsampling strategy, looking at only ~1000 taxa at a time. Do you have any advice for how to subset only pairwise distances like this?

Thanks so much in advance.

jeetsukumaran commented 5 years ago

You could apply any of the "prune" or "keep" methods to trim the tree before calculating the distance matrix.

However, the bigger picture might change the approach. Do you plan to calculate the matrix for a fixed subset or a random subset, once or multiple times? This might change the strategy. Especially if find that, depending on the resources, it may not actually take so long to calculate the distance matrix relative to reading the tree.

On Mon, May 6, 2019 at 6:27 PM Braden Tierney notifications@github.com wrote:

Hi,

I have a massive tree (upwards of 100K microbes) and am concerned about memory limitations if I were to try to extract a phylogenetic distance matrix for all taxa in it. I'm hope to use a subsampling strategy, looking at only ~1000 taxa at a time. Do you have any advice for how to subset only pairwise distances like this?

Thanks so much in advance.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jeetsukumaran/DendroPy/issues/114, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGMR52OPLYJQN3PH7O23DPUDLJRANCNFSM4HLESX4Q .

--

Jeet Sukumaran

Assistant Professor Biology Department San Diego State University

Lab: https://sukumaranlab.org/ Blog: https://jeetblogs.org/ Repositories: https://github.com/jeetsukumaran Photography: https://www.flickr.com/photos/jeetsukumaran/ Instagram: https://www.instagram.com/jeetsukumaran/ Calendar: https://goo.gl/dG5Axs

Email: jsukumaran@sdsu.edu (work) jeetsukumaran@gmail.com (personal)

Mailing Address: Biology Department, LS 262 San Diego State University 5500 Campanile Drive San Diego, CA 92182-4614

b-tierney commented 5 years ago

Wow thank you so much for the fast response.

Ah so I can prune by taxon id? That sounds like what I need.

Current plan is to generate 10-20 random subsets...you're right that it may work out if I just use a big enough VM, but I'm currently not running at scale, still writing unit tests, so strategy definitely can change, I was considering at one point just building a brand new tree for each iteration. Being able to prune gives me a solid backup plan, though.

Let me know if you have any other ideas, I deeply appreciate it.

jeetsukumaran commented 5 years ago

You can prune by taxon object using prune_taxa

https://dendropy.org/library/treemodel.html?highlight=tree%20prune#dendropy.datamodel.treemodel.Tree.prune_taxa

or by taxon labels (which is what I presume you mean by "id") using prune_taxa_with_labels

https://dendropy.org/library/treemodel.html?highlight=tree%20prune#dendropy.datamodel.treemodel.Tree.prune_taxa_with_labels

The latter is a convenience function that builds a container of taxon objects by retrieving them from the taxon namespace by label, and then calling prune_taxa with that container, as you can see by the source here: https://dendropy.org/_modules/dendropy/datamodel/treemodel.html#Tree.prune_taxa_with_labels . If by"id" you really want to use internal id#'s (the Taxon.oid property) then you would have the implement a variant of this to target that property.

Also note the complementary retain_taxa and retain_taxa_with_labels functions that prune all taxa except the ones that are passed to it as an argument:

- https://dendropy.org/_modules/dendropy/datamodel/treemodel.html#Tree.retain_taxa

https://dendropy.org/_modules/dendropy/datamodel/treemodel.html#Tree.retain_taxa_with_labels

On Mon, May 6, 2019 at 6:49 PM Braden Tierney notifications@github.com wrote:

Wow thank you so much for the fast response.

Ah so I can prune by taxon id? That sounds like what I need.

Current plan is to generate 10-20 random subsets...you're right that it may work out if I just use a big enough VM, but I'm currently not running at scale, still writing unit tests, so strategy definitely can change, I was considering at one point just building a brand new tree for each iteration. Being able to prune gives me a solid backup plan, though.

Let me know if you have any other ideas, I deeply appreciate it.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeetsukumaran/DendroPy/issues/114#issuecomment-489874895, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAGMR56T6K4WSFBW6MM2S3PUDNZRANCNFSM4HLESX4Q .

--

Jeet Sukumaran

Assistant Professor Biology Department San Diego State University

Lab: https://sukumaranlab.org/ Blog: https://jeetblogs.org/ Repositories: https://github.com/jeetsukumaran Photography: https://www.flickr.com/photos/jeetsukumaran/ Instagram: https://www.instagram.com/jeetsukumaran/ Calendar: https://goo.gl/dG5Axs

Email: jsukumaran@sdsu.edu (work) jeetsukumaran@gmail.com (personal)

Mailing Address: Biology Department, LS 262 San Diego State University 5500 Campanile Drive San Diego, CA 92182-4614

b-tierney commented 5 years ago

Spectacular, thank you so much

jeetsukumaran / DendroPy