Feature: Alternate Approaches/Optimizations for Tree Reconstruction Algorithms

mmore500 / hstrat

hstrat enables phylogenetic inference on distributed digital evolution populations

Other

3 stars 2 forks source link

from __future__ import annotations from collections.abc import Callable from random import randint from time import perf_counter import sys from typing import Any import numpy as np from hstrat import hstrat from hstrat.genome_instrumentation import HereditaryStratigraphicColumn from hstrat.phylogenetic_inference.tree._impl._build_trie_from_artifacts import ( MatrixColumn, build_trie_from_artifacts, build_trie_from_artifacts_matrix, build_trie_from_artifacts_progressive, build_trie_from_artifacts_cpp_sync ) class Genome: __slots__ = ["annotation", "data"] def __init__( self, data: list[int] | None = None, annotation: HereditaryStratigraphicColumn | None = None, ) -> None: self.data = data or [randint(1, 100)] self.annotation = annotation or HereditaryStratigraphicColumn( stratum_retention_policy=hstrat.recency_proportional_resolution_algo.Policy( 8 ), stratum_differentia_bit_width=16, ) def get_descendant(self) -> Genome: return Genome( self.data + [randint(1, 100)], self.annotation.CloneDescendant() ) @property def score(self) -> int: # weighted average, more recent preferred return sum(i * x for i, x in enumerate(self.data[:10], start=1)) // 45 def simulate_evolution( parents: list[Genome], *, generations: int, carrying_capacity: int ) -> list[Genome]: for _ in range(generations): children = sum( [[p.get_descendant() for p in parents] for _ in range(3)], [] ) children.sort(key=lambda x: x.score) parents = children[-carrying_capacity:] return parents if __name__ == "__main__": print("Starting evolutionary simulation...") generations = 20 if "--generations" in sys.argv: generations = int(sys.argv[sys.argv.index("--generations") + 1]) carrying_capacity = 100 if "--carrying-capacity" in sys.argv: carrying_capacity = int(sys.argv[sys.argv.index("--carrying-capacity") + 1]) start_pop = [Genome()] evolved = simulate_evolution( start_pop, generations=100, carrying_capacity=1000 ) extant_population = [x.annotation for x in evolved] assemblage = hstrat.pop_to_assemblage(extant_population) ranks = assemblage._assemblage_df.index.to_numpy().astype(np.uint64) differentia = assemblage._assemblage_df.to_numpy().astype(np.uint64) pop = [tuple(zip(*art.IterRankDifferentiaZip())) for art in extant_population] taxon_labels = list(map(str, range(len(pop)))) print("Benchmarking...") methods = { "c++": (build_trie_from_artifacts_cpp_sync, (pop, taxon_labels), {}), "cpp": (build_trie_from_artifacts_cpp_sync, (pop, taxon_labels), {}), "matrix": (build_trie_from_artifacts_matrix, (ranks, differentia, extant_population[0]._stratum_differentia_bit_width, [*range(len(pop))]), {}), "normal": (build_trie_from_artifacts, (extant_population, taxon_labels, False, lambda x: x), {}), "progressive_synchronous": (build_trie_from_artifacts_progressive, (extant_population, taxon_labels), {"multiprocess": False}), "progressive_asynchronous": (build_trie_from_artifacts_progressive, (extant_population, taxon_labels), {"multiprocess": True}), } if "--methods" in sys.argv: funcs = sys.argv[sys.argv.index("--methods") + 1:] else: funcs = methods.keys() - {'cpp'} if "--matrix-dry" in sys.argv: f, args, _ = methods['matrix'] f(*args) for name in funcs: for title, (f, args, kwargs) in methods.items(): if name in title: start = perf_counter() for _ in range(10): f(*args, **kwargs) print(f"{title.replace('_', ' ').title()}: {perf_counter() - start:.2f}s")

v1 goals/clean-up before merging:

Easy to fix:

[x] Speed up data transfer from C++ to Python by converting some of the std::vectors to py::array_ts
[x] Test new implementation to indistinguishable child consolidation to replace std::unordered map for correctness
- Reverted to old implementation that used std::unordered_map
[ ] Add docstring-like comments to C++ functions
[x] Fixup/flesh out the help message for the python3 -m hstrat.dataframe.surface_unpack_reconstruct command
[x] Use std::chunk_by in main loop
- Determined to not be possible (or be too annoying to implement) using the py::array_t interface
[x] Make cppimport an optional dependency
- [x] if cppimport is optional, loading needs to not fail without it
[x] Use Python logging instead of C++ logging
[x] Add tests for the C++ searchtable approach
[x] Expose another C++ function that builds from a vector of vectors rather than a contiguous py::array_t (for use within the library in the build_tree_searchtable function)

[x] fix compile warnings (ensure that -Wall and other warnings are set, too)

/usr/local/lib/python3.12/dist-packages/hstrat/phylogenetic_inference/tree/.rendered.build_tree_searchtable_cpp.cpp: In function ‘Records build_trie_searchtable(const pybind11::array_t<long 
unsigned int>&, const pybind11::array_t<long unsigned int>&, const pybind11::array_t<long unsigned int>&, const pybind11::array_t<long unsigned int>&, std::optional<pybind11::handle>)’:
/usr/local/lib/python3.12/dist-packages/hstrat/phylogenetic_inference/tree/.rendered.build_tree_searchtable_cpp.cpp:443:27: warning: comparison of integer expressions of different signedness
: ‘u64’ {aka ‘long unsigned int’} and ‘pybind11::ssize_t’ {aka ‘long int’} [-Wsign-compare]
443 |         for (u64 i = 1; i < ranks.size(); ++i) {

Harder to fix:

[x] Determine the optimal way to export data from C++ to Python, probability by working with a map of py::array_t
[ ] Get CI passing
[ ] Create a (more than smoke) test for the end-to-end reconstruction process
[ ] Fix documentation build on readthedocs
[x] Hook up C++ approach to end-to-end reconstruction process

v2 future goals after merging:

[ ] Determine which detach_search_parent calls are necessary
- Remove one, run the tests, and see if they fail or not
[ ] Consider adding comments and asserts for ChildrenIterator invalidation
[ ] Determine if using {rank, differentia} is needed for accessing the std::unordered_map, or if just differentia is fine
[ ] Make calls to consolidate_trie smarter (potentially only calling when the rank changes)
[ ] Wrap some internals of the C++ code with Python to unit test them
[ ] Figure out distribution, probably having an action that builds wheels automatically
- for now, put the cpp loads in a try/catch
[ ] Refactor code for better maintainability
[ ] Look into creating a multiprocessing approach for the consolidated tree-building algorithm by splitting work into several subtrees and merging them
[x] Consider alternate data representation formats to maximize cache locality and minimize abstraction cost
[x] Add a progress bar for the C++ implementation
- Consider taking in a TQDM object and manually iterating it
[ ] Add [[likely]] and [[unlikely]] tags to speed up branch prediction
[ ] Remove the need for copying the out data on the C++ to Python end
- Either operate directly on py::array_t objects or figure out how to make the memoryview object lifetime (and therefore the Records object lifetime) tied to the np.frombuffer
[ ] Speed up the "looking up ingest times..." step

mmore500 / hstrat