scikit-bio / scikit-bio

scikit-bio: a community-driven Python library for bioinformatics, providing versatile data structures, algorithms and educational resources.
https://scikit.bio
BSD 3-Clause "New" or "Revised" License
891 stars 269 forks source link

`faith_pd` uses more memory and is slower than necessary in many cases #1518

Open wdwvt1 opened 7 years ago

wdwvt1 commented 7 years ago

Short summary: The runtime and memory consumption of skbio.diversity.alpha._faith_pd scale with the number of tips in the supplied phylogenetic tree and not with the number of features in the passed otu_ids. This causes significant unnecessary computation when operating on a sample whose features are a small subset of the tips in the tree. Removing tips that are not shared between the feature table and the tree helps reduce this computation, but significantly more speedup can be had using the trimming strategy outlined at the end. For my dataset, I needed to compute PD on 500 different subsets of the data table, making a speed and memory usage important (and the current implementation of PD unusable).

Overly long summary: Below is an example from a dataset I'm working on that has ~3700 samples and 460000 features. For the calculations, I removed features below a total count of 25 (mc25), leaving me with approximately 34000 features in the data set. I then computed PD on all samples using either the unpruned tree or two reduced trees (approximate number of tips and internal nodes).

internal_nodes tips ram_usage time
460000 460000 31gb 232 s
150000 150000 15 gb 30 s
34000 34000 8 gb 5 s

I was not using swap for 150000 or 34000 tip trees, but used significant swap for 460000 tip tree.

I believe the reason for the scaling with the number of features in the tree is the call to skbio.diversity._util._vectorize_counts_and_tree and its call to skbio.diversity._phylogenetic._nodes_by_counts. The full tree traversal and indexing of the tree is significantly more expensive with more tips, even when we know we aren't going to be using the vast majority of them.

My first thought was that I could use the TreeNode.remove_deleted and TreeNode.prune functions to reduce my 460000 tip tree to a tree covering only the 34000 features in the mc25 dataset I was using.

tr = TreeNode.read('/path/to/my/tree.tre')
tips = set([tip.name for tip in tr.tips()])
shared_features = tips.intersection(features_to_keep)
tr.remove_deleted(lambda node: node.name in tips and node.name not in shared_features)
tr.prune()

The code reduces my 460000 tip tree to a 150000 tip tree - still significantly more tips than are in my data table. The problem is that tips are removed if they are not shared, but the nodes between a removed tip and the LCA of a retained tip should be removed, but are not. The diagram below illustrates this. Our starting tree on the left has a tip for every feature in the dataset. Our pruned tree (middle, removed tips in red) lacks the tips that we've removed from our dataset (e.g. by mc25 filtering) but still contains all the internal nodes and branches between the removed tips and their LCA with retained nodes. What we'd like is the tree on the right, where nodes are retained (black) only if they are the LCA of two features in the dataset.

screen shot 2017-06-22 at 7 37 14 pm

The following code allowed me to trim/prune the tree in the way I'd like, and allowed me to halve the ram usage and reduce computation time by 80%. I am sure it could be written more intelligently to take advantage of the methods of TreeNode. The code costs a significant amount of time to run (~100 seconds on my ~920000 node tree) so it would only be valuable for large trees, or repeated PD calculations but I think we could get the runtime down.

assert tr.is_root()
tr.keep = True

for tip in tr.tips():
    if tip.name in shared_features:
        cn =  tr.find(tip.name)
        cn_is_root = False
        while not cn_is_root:
            cn.keep = True
            cn = cn.parent
            cn_is_root = cn.is_root()

            # If we've already traversed this branch we don't need to
            # revisit parent nodes until the root - they've already
            # been set to keep.
            try:
                cn.keep  # Stop traversing if we find keep set.
                break
            except AttributeError:
                pass  # Keep traversing.

def _f(node):
    try:
        return not node.keep
    except AttributeError:
        return True

tr.remove_deleted(_f)
tr.prune()
wasade commented 7 years ago

How does it compare to TreeNode.shear?

On Jun 22, 2017 7:55 PM, "Will Van Treuren" notifications@github.com wrote:

Short summary: The runtime and memory consumption of skbio.diversity.alpha._faith_pd scale with the number of tips in the supplied phylogenetic tree and not with the number of features in the passed otu_ids. This causes significant unnecessary computation when operating on a sample whose features are a small subset of the tips in the tree. Removing tips that are not shared between the feature table and the tree helps reduce this computation, but significantly more speedup can be had using the trimming strategy outlined at the end. For my dataset, I needed to compute PD on 500 different subsets of the data table, making a speed and memory usage important (and the current implementation of PD unusable).

Overly long summary: Below is an example from a dataset I'm working on that has ~3700 samples and 460000 features. For the calculations, I removed features below a total count of 25 (mc25), leaving me with approximately 34000 features in the data set. I then computed PD on all samples using either the unpruned tree or two reduced trees (approximate number of tips and internal nodes). internal_nodes tips ram_usage time 460000 460000 31gb 232 s 150000 150000 15 gb 30 s 34000 34000 8 gb 5 s

I was not using swap for 150000 or 34000 tip trees, but used significant swap for 460000 tip tree.

I believe the reason for the scaling with the number of features in the tree is the call to skbio.diversity._util._vectorize_counts_and_tree and its call to skbio.diversity._phylogenetic._nodes_by_counts. The full tree traversal and indexing of the tree is significantly more expensive with more tips, even when we know we aren't going to be using the vast majority of them.

My first thought was that I could use the TreeNode.remove_deleted and TreeNode.prune functions to reduce my 460000 tip tree to a tree covering only the 34000 features in the mc25 dataset I was using.

tr = TreeNode.read('/path/to/my/tree.tre') tips = set([tip.name for tip in tr.tips()]) shared_features = tips.intersection(features_to_keep) tr.remove_deleted(lambda node: node.name in tips and node.name not in shared_features) tr.prune()

The code reduces my 460000 tip tree to a 150000 tip tree - still significantly more tips than are in my data table. The problem is that tips are removed if they are not shared, but the nodes between a removed tip and the LCA of a retained tip should be removed, but are not. The diagram below illustrates this. Our starting tree on the left has a tip for every feature in the dataset. Our pruned tree (middle, removed tips in red) lacks the tips that we've removed from our dataset (e.g. by mc25 filtering) but still contains all the internal nodes and branches between the removed tips and their LCA with retained nodes. What we'd like is the tree on the right, where nodes are retained (black) only if they are the LCA of two features in the dataset. [image: screen shot 2017-06-22 at 7 37 14 pm] https://user-images.githubusercontent.com/1048569/27464492-a2d1f6a4-5782-11e7-9192-300612adeb65.png

The following code allowed me to trim/prune the tree in the way I'd like, and allowed me to halve the ram usage and reduce computation time by 80%. I am sure it could be written more intelligently to take advantage of the methods of TreeNode. The code costs a significant amount of time to run (~100 seconds on my ~920000 node tree) so it would only be valuable for large trees, or repeated PD calculations but I think we could get the runtime down.

assert tr.is_root() tr.keep = True for tip in tr.tips(): if tip.name in shared_features: cn = tr.find(tip.name) cn_is_root = False while not cn_is_root: cn.keep = True cn = cn.parent cn_is_root = cn.is_root()

        # If we've already traversed this branch we don't need to
        # revisit parent nodes until the root - they've already
        # been set to keep.
        try:
            cn.keep  # Stop traversing if we find keep set.
            break
        except AttributeError:
            pass  # Keep traversing.

def _f(node): try: return not node.keep except AttributeError: return True

tr.remove_deleted(_f) tr.prune()

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/biocore/scikit-bio/issues/1518, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8sr2EIgV5u1xFWfoZv8YjghHNFoRCks5sGykTgaJpZM4ODE54 .

wdwvt1 commented 7 years ago

hmm, feeling a little sheepish - shear is exactly what I want and reproduces what I got. shear seems a bit faster - takes about 80 seconds. would it be reasonable to call shear in skbio.diversity.alpha._faith_pd if the number of tips is above some threshold?

wasade commented 7 years ago

Awesome :) And seems reasonable. Perhaps parameterize with subset_tree?

If you want to go blazing fast, I put together a balanced parentheses tree structure for the beta diversity decomposition for the EMP. It's pip installable. The interface is not fully fleshed out, the documentation is still a little sparse, and the nature of the structure made it awkward to make it mirror the skbio.TreeNode. But, it's fast and light weight: a 1000 random tip shear on a 7M tree is < 1s, and the resident memory is about 20x IIRC less than TreeNode. You can go from BP to TreeNode with bp.to_skbio_treenode, and back to BP with bp.from_skbio_treenode (I think that's the method). Interestingly, bp.to_skbio_treenode(bp.parse_newick(open('foo').read())) is faster than the skbio parser, most likely because we can preallocate.

wdwvt1 commented 7 years ago

Thanks a lot @wasade. The bp structure looks awesome, but I think overkill for my particular use (though not by much).

As far as changing the behavior of skbio.diversity.alpha._faith_pd - I'll do memory and time profiling of shearing and PD calculations on a variety of tree sizes and see what threshold makes sense, then report back.