Cache state trees to improve performance of large HSMs

thedrow commented 4 years ago

The following patch caches state trees. I'm not 100% sure but they seem pretty immutable to me. If not, we'll invalidate the cache when appropriate.

thedrow commented 4 years ago

Can you think of a situation where the state tree changes?

thedrow commented 4 years ago

If we were to use tuples instead of lists to store states the error would go away.

coveralls commented 4 years ago

Coverage increased (+0.007%) to 98.568% when pulling dd4302e394ec0d62b5761208f4f9e93fbe7909e1 on thedrow:cache-state-tree into ddd0709af13e007c9cd8dd56fda77dbc51f90f60 on pytransitions:master.

thedrow commented 4 years ago

Hopefully calculating the cache key is less expensive than the building of the state tree. I don't understand why we use lists in the first place. Are they mutated somewhere?

thedrow commented 4 years ago

Here's a very unscientific benchmark based on the TestParallel::test_multiple_deeper & time.time(): before caching - 0.004698991775512695 after caching - 0.0015370845794677734

without caching - 0.002394437789916992, 0.0014543533325195312

If we could avoid constructing the cache key maybe we'll see some difference.

aleneum commented 4 years ago

If we were to use tuples instead of lists to store states the error would go away

tuples might also work. the state is only read and assigned to models and afair not edited in place.

Can you think of a situation where the state tree changes?

yeah, this state tree building comes with parallel states. For instance:

[A_1, [A_2_a, A_2_b]]

lets assume all have a trigger with the same name. The evaluation order is ['A_2_a', 'A_2_b', 'A_1']. Now in A_2_a the transitions actually exits A_2 (A_2 -> A_3). So after the evaluation of A_2_a the state changed to [A_1, A_3]. This invalidates A_2_b but not A_1.

aleneum commented 4 years ago

This is why currently the tree has to be rebuilt here as well

aleneum commented 4 years ago

But is caching really necessary? I would assume that tree building and conversion only has a significant impact when models have a significant number of parallel states.

aleneum commented 4 years ago

The fact that states are truly nested now has some significant impact on the execution time:

from transitions.extensions.nesting import HierarchicalMachine as HSM
from transitions.extensions.nesting_legacy import HierarchicalMachine as OldHSM
import timeit

child_names_letters = ['a', 'b', 'c', 'd']
child_names_numbers = ['1', '2', '3', '4', '5']

states = [
    'A',
    'B',
    {'name': 'C', 'children': child_names_letters},
    {'name': 'D', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers}]},
    {'name': 'E', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers + [{'name': '6', 'children': child_names_letters}]}]},
    {'name': 'F', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers + [{'name': '6', 'children': child_names_letters}]}]},
    {'name': 'G', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers}]},
]

transitions = [
    ['go', 'A', 'B'],
    ['go', 'B', 'C_a'],
    ['go', 'C', 'D_e_3'],
    ['go', 'D_e_3', 'E_e_6_d'],
    ['go', 'E_e', 'F_e_4'],
    ['go', 'F', 'G_e_3'],
    ['go', 'G', 'A'],
    ['reset', '*', 'A'],
    ['end', '*', 'G'],
    ['loop', 'B', '='],
]

m1 = HSM(states=states, transitions=transitions, initial='A')
m2 = OldHSM(states=states, transitions=transitions, initial='A')

def test_m1():
    m1.go()

def test_m2():
    m2.go()

print(timeit.timeit("test_m1()", setup="from __main__ import test_m1", number=100000))
# >>> 9.019239799999923
print(timeit.timeit("test_m2()", setup="from __main__ import test_m2", number=100000))
# >>> 3.666767300000174

In the old state machine, all states are collected in one 'flat' state dictionary. Now HSM has to iterate through multiple nested state collections. Model triggers are also not connected to events directly (tieing this to Machine now is needed to 'fix' the order of event execution.)

aleneum commented 4 years ago

Without nested states, the difference is still significant:

from transitions.extensions.nesting import HierarchicalMachine as HSM
from transitions.extensions.nesting_legacy import HierarchicalMachine as OldHSM
import timeit

states = [
    'A',
    'B',
    'C',
    'D',
    'E',
    'F'
]

m1 = HSM(states=states, ordered_transitions=True, initial='A')
m2 = OldHSM(states=states, ordered_transitions=True, initial='A')

def test_m1():
    m1.next_state()

def test_m2():
    m2.next_state()

print(timeit.timeit("test_m1()", setup="from __main__ import test_m1", number=100000))
# >>> 4.350121200000103
print(timeit.timeit("test_m2()", setup="from __main__ import test_m2", number=100000))
# >>> 1.6667865999997957
# simple Machine has 1.1292143000000578

aleneum commented 4 years ago

For the last example, some quick and dirty removals of _build_state_list and _build_state_tree and multiple loops bring the execution time down to roughly 3.233379299999797.

aleneum commented 4 years ago

Hmm, I checked out the PR and ran the above test:

# HSM (with cache): 4.471130799999628
# OldHSM: 1.6707495999999082

thedrow commented 4 years ago

My initial thought was that these are constant or mostly constant and recreating those nested dictionaries has a cost. Either I was wrong and this is not a good chance for optimization or the cache key construction is taking roughly the same amount of time.

If we were to store states as tuples, that would work. Any suggestions on how to do that?

thedrow commented 4 years ago

Here's a more scientific benchmark using pytest-benchmark:

benchmark: 3.2.3 (defaults: timer=time.perf_counter disable_gc=True min_rounds=5 min_time=6 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
------------------------------------------------------------------------------------- benchmark: 2 tests -------------------------------------------------------------------------------------
Name (time in us)          Min                 Max                Mean            StdDev              Median               IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_old_hsm           58.2068 (1.0)       59.0170 (1.0)       58.4350 (1.0)      0.3337 (1.0)       58.3513 (1.0)      0.3148 (1.0)           1;1       17.1130 (1.0)           5      103673
test_hsm              117.5868 (2.02)     123.1212 (2.09)     119.6221 (2.05)     2.3646 (7.09)     118.7913 (2.04)     3.7765 (12.00)         1;0        8.3597 (0.49)          5      100000
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

aleneum commented 4 years ago

Hi @thedrow,

thank you for taking the time to improve the performance of HSM. What are test_old_hsm and test_hsm referring to? The legacy 'flat' version vs the embedded version or without caching and with caching?

thedrow commented 4 years ago

Without and with caching.

aleneum commented 4 years ago

Nice! Mind sharing your code with me, so I can try to reproduce the results?

thedrow commented 4 years ago

Sure, I'll find it.

aleneum commented 3 years ago

I gave this another try since caching sounds like a good idea.

Here is my test:

from transitions.extensions.nesting import HierarchicalMachine as HSM
import copy

child_names_letters = ['a', 'b', 'c', 'd']
child_names_numbers = ['1', '2', '3', '4', '5']

states = [
    'A',
    'B',
    {'name': 'C', 'children': child_names_letters},
    {'name': 'D', 'parallel': child_names_letters + [{'name': 'e', 'children': child_names_numbers}]},
    {'name': 'E', 'children': child_names_letters + [{'name': 'e', 'parallel': child_names_numbers + [{'name': '6', 'parallel': child_names_letters}]}]},
    {'name': 'F', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers + [{'name': '6', 'children': child_names_letters}]}]},
    {'name': 'G', 'parallel': child_names_letters + [{'name': 'e', 'children': child_names_numbers}]},
]

transitions = [
    ['go', 'A', 'B'],
    ['go', 'B', 'C_a'],
    ['go', 'C', 'D'],
    ['go', 'D', 'E_e'],
    ['go', 'E_e', 'F_e_4'],
    ['go', 'F', 'G'],
    ['go', 'G', 'A'],
    ['reset', '*', 'A'],
    ['end', '*', 'G'],
    ['loop', 'B', '='],
]

class NoCacheMachine(HSM):

    def _get_state_tree(self, model_states, separator):
        return self._build_state_tree(model_states, separator)

class CachedMachine(HSM):

    def _get_state_tree(self, model_states, separator):
        try:
            tree = self._state_tree_cache[model_states]
        except KeyError:
            tree = self._state_tree_cache[model_states] = self._build_state_tree(model_states, separator)
        return tree

class CachedDeepCopyMachine(HSM):

    def _get_state_tree(self, model_states, separator):
        try:
            tree = self._state_tree_cache[model_states]
        except KeyError:
            tree = self._state_tree_cache[model_states] = self._build_state_tree(model_states, separator)
        return copy.deepcopy(tree)

m_cached = CachedMachine(states=states, transitions=transitions, initial='A')
m_uncached = NoCacheMachine(states=states, transitions=transitions, initial='A')
m_deep = CachedDeepCopyMachine(states=states, transitions=transitions, initial='A')

def test_uncached(benchmark):
    benchmark(m_uncached.go)

def test_cached(benchmark):
    benchmark(m_cached.go)

def test_deepcopy(benchmark):
    benchmark(m_deep.go)

I used the dev-caching branch where I altered model.state to be tuples. This should make caching faster because lists dont have to be converted into tuples anymore. Additionally, the test heavily uses parallel and nested states which should favour caching. For the sake of similarity I have overridden _get_state_tree for all tests even though _get_state_tree in HSM uses caching already in the aforementioned branch.

Running the code like this yields the following result on my laptop:

-------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------
Name (time in us)         Min                 Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_cached           45.8000 (1.0)      851.6000 (1.07)     132.5448 (1.0)      70.7601 (1.0)      118.7000 (1.0)      104.0250 (1.0)       2146;52        7.5446 (1.0)        6549           1
test_uncached         52.4000 (1.14)     799.3000 (1.0)      144.7823 (1.09)     75.8492 (1.07)     125.7000 (1.06)     135.7000 (1.30)      2750;19        6.9069 (0.92)       6431           1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

CachingMachine is roughly 10 percent faster which is nice. Returning trees from _state_tree_cache is not enough though because these trees will be altered during state transitions. A transition from A to D will cause _state_tree_cache[('A',)] = OrderedDict('D'). A deepcopy of the cached tree however has a huge impact as shown with CachedDeepCopyMachine:

--------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------
Name (time in us)         Min                   Max                Mean             StdDev              Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_cached           46.7000 (1.0)      1,050.7000 (1.17)     138.7820 (1.0)      72.7869 (1.0)      124.1000 (1.0)      115.7000 (1.0)       3469;68        7.2055 (1.0)       10858           1
test_uncached         53.3000 (1.14)       895.2000 (1.0)      150.3704 (1.08)     79.1206 (1.09)     130.7000 (1.05)     137.1500 (1.19)        893;8        6.6502 (0.92)       2236           1
test_deepcopy         70.9000 (1.52)     1,029.8000 (1.15)     191.6219 (1.38)     95.5599 (1.31)     164.3000 (1.32)     194.7000 (1.68)      3418;19        5.2186 (0.72)       6994           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Caching becomes even slower than uncached processing (by roughly 20 percent). If there is another way to store 'template' trees without the need of a deepcopy, caching might still be useful though. Any ideas?

thedrow commented 3 years ago

Try invalidating the entire cache whenever the tree is altered.

pytransitions / transitions

Cache state trees to improve performance of large HSMs #416