Closed thedrow closed 3 years ago
Can you think of a situation where the state tree changes?
If we were to use tuples instead of lists to store states the error would go away.
Hopefully calculating the cache key is less expensive than the building of the state tree. I don't understand why we use lists in the first place. Are they mutated somewhere?
Here's a very unscientific benchmark based on the TestParallel::test_multiple_deeper
& time.time()
:
before caching - 0.004698991775512695
after caching - 0.0015370845794677734
without caching - 0.002394437789916992, 0.0014543533325195312
If we could avoid constructing the cache key maybe we'll see some difference.
If we were to use tuples instead of lists to store states the error would go away
tuples might also work. the state is only read and assigned to models and afair not edited in place.
Can you think of a situation where the state tree changes?
yeah, this state tree building comes with parallel states. For instance:
[A_1, [A_2_a, A_2_b]]
lets assume all have a trigger with the same name. The evaluation order is ['A_2_a', 'A_2_b', 'A_1']. Now in A_2_a
the transitions actually exits A_2 (A_2 -> A_3
). So after the evaluation of A_2_a
the state changed to [A_1, A_3]. This invalidates A_2_b
but not A_1
.
But is caching really necessary? I would assume that tree building and conversion only has a significant impact when models have a significant number of parallel states.
The fact that states are truly nested now has some significant impact on the execution time:
from transitions.extensions.nesting import HierarchicalMachine as HSM
from transitions.extensions.nesting_legacy import HierarchicalMachine as OldHSM
import timeit
child_names_letters = ['a', 'b', 'c', 'd']
child_names_numbers = ['1', '2', '3', '4', '5']
states = [
'A',
'B',
{'name': 'C', 'children': child_names_letters},
{'name': 'D', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers}]},
{'name': 'E', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers + [{'name': '6', 'children': child_names_letters}]}]},
{'name': 'F', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers + [{'name': '6', 'children': child_names_letters}]}]},
{'name': 'G', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers}]},
]
transitions = [
['go', 'A', 'B'],
['go', 'B', 'C_a'],
['go', 'C', 'D_e_3'],
['go', 'D_e_3', 'E_e_6_d'],
['go', 'E_e', 'F_e_4'],
['go', 'F', 'G_e_3'],
['go', 'G', 'A'],
['reset', '*', 'A'],
['end', '*', 'G'],
['loop', 'B', '='],
]
m1 = HSM(states=states, transitions=transitions, initial='A')
m2 = OldHSM(states=states, transitions=transitions, initial='A')
def test_m1():
m1.go()
def test_m2():
m2.go()
print(timeit.timeit("test_m1()", setup="from __main__ import test_m1", number=100000))
# >>> 9.019239799999923
print(timeit.timeit("test_m2()", setup="from __main__ import test_m2", number=100000))
# >>> 3.666767300000174
In the old state machine, all states are collected in one 'flat' state dictionary. Now HSM has to iterate through multiple nested state collections. Model triggers are also not connected to events directly (tieing this to Machine now is needed to 'fix' the order of event execution.)
Without nested states, the difference is still significant:
from transitions.extensions.nesting import HierarchicalMachine as HSM
from transitions.extensions.nesting_legacy import HierarchicalMachine as OldHSM
import timeit
states = [
'A',
'B',
'C',
'D',
'E',
'F'
]
m1 = HSM(states=states, ordered_transitions=True, initial='A')
m2 = OldHSM(states=states, ordered_transitions=True, initial='A')
def test_m1():
m1.next_state()
def test_m2():
m2.next_state()
print(timeit.timeit("test_m1()", setup="from __main__ import test_m1", number=100000))
# >>> 4.350121200000103
print(timeit.timeit("test_m2()", setup="from __main__ import test_m2", number=100000))
# >>> 1.6667865999997957
# simple Machine has 1.1292143000000578
For the last example, some quick and dirty removals of _build_state_list
and _build_state_tree
and multiple loops bring the execution time down to roughly 3.233379299999797
.
Hmm, I checked out the PR and ran the above test:
# HSM (with cache): 4.471130799999628
# OldHSM: 1.6707495999999082
My initial thought was that these are constant or mostly constant and recreating those nested dictionaries has a cost. Either I was wrong and this is not a good chance for optimization or the cache key construction is taking roughly the same amount of time.
If we were to store states as tuples, that would work. Any suggestions on how to do that?
Here's a more scientific benchmark using pytest-benchmark
:
benchmark: 3.2.3 (defaults: timer=time.perf_counter disable_gc=True min_rounds=5 min_time=6 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
------------------------------------------------------------------------------------- benchmark: 2 tests -------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_old_hsm 58.2068 (1.0) 59.0170 (1.0) 58.4350 (1.0) 0.3337 (1.0) 58.3513 (1.0) 0.3148 (1.0) 1;1 17.1130 (1.0) 5 103673
test_hsm 117.5868 (2.02) 123.1212 (2.09) 119.6221 (2.05) 2.3646 (7.09) 118.7913 (2.04) 3.7765 (12.00) 1;0 8.3597 (0.49) 5 100000
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Hi @thedrow,
thank you for taking the time to improve the performance of HSM.
What are test_old_hsm
and test_hsm
referring to? The legacy 'flat' version vs the embedded version or without caching and with caching?
Without and with caching.
Nice! Mind sharing your code with me, so I can try to reproduce the results?
Sure, I'll find it.
I gave this another try since caching sounds like a good idea.
Here is my test:
from transitions.extensions.nesting import HierarchicalMachine as HSM
import copy
child_names_letters = ['a', 'b', 'c', 'd']
child_names_numbers = ['1', '2', '3', '4', '5']
states = [
'A',
'B',
{'name': 'C', 'children': child_names_letters},
{'name': 'D', 'parallel': child_names_letters + [{'name': 'e', 'children': child_names_numbers}]},
{'name': 'E', 'children': child_names_letters + [{'name': 'e', 'parallel': child_names_numbers + [{'name': '6', 'parallel': child_names_letters}]}]},
{'name': 'F', 'children': child_names_letters + [{'name': 'e', 'children': child_names_numbers + [{'name': '6', 'children': child_names_letters}]}]},
{'name': 'G', 'parallel': child_names_letters + [{'name': 'e', 'children': child_names_numbers}]},
]
transitions = [
['go', 'A', 'B'],
['go', 'B', 'C_a'],
['go', 'C', 'D'],
['go', 'D', 'E_e'],
['go', 'E_e', 'F_e_4'],
['go', 'F', 'G'],
['go', 'G', 'A'],
['reset', '*', 'A'],
['end', '*', 'G'],
['loop', 'B', '='],
]
class NoCacheMachine(HSM):
def _get_state_tree(self, model_states, separator):
return self._build_state_tree(model_states, separator)
class CachedMachine(HSM):
def _get_state_tree(self, model_states, separator):
try:
tree = self._state_tree_cache[model_states]
except KeyError:
tree = self._state_tree_cache[model_states] = self._build_state_tree(model_states, separator)
return tree
class CachedDeepCopyMachine(HSM):
def _get_state_tree(self, model_states, separator):
try:
tree = self._state_tree_cache[model_states]
except KeyError:
tree = self._state_tree_cache[model_states] = self._build_state_tree(model_states, separator)
return copy.deepcopy(tree)
m_cached = CachedMachine(states=states, transitions=transitions, initial='A')
m_uncached = NoCacheMachine(states=states, transitions=transitions, initial='A')
m_deep = CachedDeepCopyMachine(states=states, transitions=transitions, initial='A')
def test_uncached(benchmark):
benchmark(m_uncached.go)
def test_cached(benchmark):
benchmark(m_cached.go)
def test_deepcopy(benchmark):
benchmark(m_deep.go)
I used the dev-caching
branch where I altered model.state
to be tuples. This should make caching faster because lists dont have to be converted into tuples anymore. Additionally, the test heavily uses parallel and nested states which should favour caching. For the sake of similarity I have overridden _get_state_tree
for all tests even though _get_state_tree
in HSM
uses caching already in the aforementioned branch.
Running the code like this yields the following result on my laptop:
-------------------------------------------------------------------------------------- benchmark: 2 tests --------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_cached 45.8000 (1.0) 851.6000 (1.07) 132.5448 (1.0) 70.7601 (1.0) 118.7000 (1.0) 104.0250 (1.0) 2146;52 7.5446 (1.0) 6549 1
test_uncached 52.4000 (1.14) 799.3000 (1.0) 144.7823 (1.09) 75.8492 (1.07) 125.7000 (1.06) 135.7000 (1.30) 2750;19 6.9069 (0.92) 6431 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CachingMachine
is roughly 10 percent faster which is nice. Returning trees from _state_tree_cache
is not enough though because these trees will be altered during state transitions. A transition from A
to D
will cause _state_tree_cache[('A',)] = OrderedDict('D')
. A deepcopy of the cached tree however has a huge impact as shown with CachedDeepCopyMachine
:
--------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_cached 46.7000 (1.0) 1,050.7000 (1.17) 138.7820 (1.0) 72.7869 (1.0) 124.1000 (1.0) 115.7000 (1.0) 3469;68 7.2055 (1.0) 10858 1
test_uncached 53.3000 (1.14) 895.2000 (1.0) 150.3704 (1.08) 79.1206 (1.09) 130.7000 (1.05) 137.1500 (1.19) 893;8 6.6502 (0.92) 2236 1
test_deepcopy 70.9000 (1.52) 1,029.8000 (1.15) 191.6219 (1.38) 95.5599 (1.31) 164.3000 (1.32) 194.7000 (1.68) 3418;19 5.2186 (0.72) 6994 1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Caching becomes even slower than uncached processing (by roughly 20 percent). If there is another way to store 'template' trees without the need of a deepcopy, caching might still be useful though. Any ideas?
Try invalidating the entire cache whenever the tree is altered.
The following patch caches state trees. I'm not 100% sure but they seem pretty immutable to me. If not, we'll invalidate the cache when appropriate.