legrego / homeassistant-elasticsearch

Publish Home-Assistant events to Elasticsearch
https://legrego.github.io/homeassistant-elasticsearch/
MIT License
143 stars 38 forks source link

Use lru cache for datastream and attribute normalization #294

Closed strawgate closed 2 weeks ago

strawgate commented 2 weeks ago

Part of https://github.com/legrego/homeassistant-elasticsearch/issues/278

strawgate commented 2 weeks ago

Should be good to go

github-actions[bot] commented 2 weeks ago

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
elasticsearch
   __init__.py90297%208–209
   config_flow.py2502888%83, 122, 230, 300, 323, 337, 378–381, 383–384, 386, 388–390, 392–393, 407–408, 412, 452, 458, 498, 573, 582, 616–617
   const.py330100% 
   entity_details.py330100% 
   errors.py28389%51, 60, 67
   es_doc_creator.py1830100% 
   es_doc_publisher.py2581594%103, 245–246, 262–263, 366, 457–458, 460–461, 463–467
   es_gateway.py1002080%82–83, 98, 102, 118–120, 122, 124–125, 127–131, 133–137
   es_index_manager.py121893%228–229, 246–247, 252–253, 282–283
   es_integration.py37294%43–44
   es_privilege_check.py550100% 
   es_serializer.py10190%17
   es_version.py300100% 
   logger.py20100% 
   system_info.py25196%37
   utils.py40100% 
TOTAL12598093% 

Tests Skipped Failures Errors Time
95 0 :zzz: 0 :x: 0 :fire: 7.174s :stopwatch:
strawgate commented 2 weeks ago

Fixes: #293

strawgate commented 2 weeks ago

My environment has 110 devices and ~600 entities. My unique attribute count is 250 which is well below the 1024 threshold. The 1024 is how many attributes can be in a single batch, not total attributes historically.

I may take some time to test how much worse performance gets if the cache starts thrashing as well as how much memory is consumed by the LRU with my current ~250 attributes.

My immediate guess is that the hashing function probably uses sha1 which outputs a constant 20 bytes, and the value is the sanitized name which probably averages at most 10 bytes for 30 bytes per entry 1024 entries 30% overhead or about 39kb. I imagine we probably could bump this to be quite large like 8192 or higher and only consume a noticeable amount of memory in particularly large environments in which case the serialization for bulk probably consumes significantly more memory than the memoization.