Use lru cache for datastream and attribute normalization

strawgate commented 2 weeks ago

Part of https://github.com/legrego/homeassistant-elasticsearch/issues/278

strawgate commented 2 weeks ago

Should be good to go

github-actions[bot] commented 2 weeks ago

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
elasticsearch
__init__.py	90	2	97%	208–209
config_flow.py	250	28	88%	83, 122, 230, 300, 323, 337, 378–381, 383–384, 386, 388–390, 392–393, 407–408, 412, 452, 458, 498, 573, 582, 616–617
const.py	33	0	100%
entity_details.py	33	0	100%
errors.py	28	3	89%	51, 60, 67
es_doc_creator.py	183	0	100%
es_doc_publisher.py	258	15	94%	103, 245–246, 262–263, 366, 457–458, 460–461, 463–467
es_gateway.py	100	20	80%	82–83, 98, 102, 118–120, 122, 124–125, 127–131, 133–137
es_index_manager.py	121	8	93%	228–229, 246–247, 252–253, 282–283
es_integration.py	37	2	94%	43–44
es_privilege_check.py	55	0	100%
es_serializer.py	10	1	90%	17
es_version.py	30	0	100%
logger.py	2	0	100%
system_info.py	25	1	96%	37
utils.py	4	0	100%
TOTAL	1259	80	93%

Tests	Skipped	Failures	Errors	Time
95	0 :zzz:	0 :x:	0 :fire:	7.174s :stopwatch:

strawgate commented 2 weeks ago

Fixes: #293

strawgate commented 2 weeks ago

My environment has 110 devices and ~600 entities. My unique attribute count is 250 which is well below the 1024 threshold. The 1024 is how many attributes can be in a single batch, not total attributes historically.

I may take some time to test how much worse performance gets if the cache starts thrashing as well as how much memory is consumed by the LRU with my current ~250 attributes.

My immediate guess is that the hashing function probably uses sha1 which outputs a constant 20 bytes, and the value is the sanitized name which probably averages at most 10 bytes for 30 bytes per entry 1024 entries 30% overhead or about 39kb. I imagine we probably could bump this to be quite large like 8192 or higher and only consume a noticeable amount of memory in particularly large environments in which case the serialization for bulk probably consumes significantly more memory than the memoization.

legrego / homeassistant-elasticsearch

Use lru cache for datastream and attribute normalization #294