legrego / homeassistant-elasticsearch

Publish Home-Assistant events to Elasticsearch
https://legrego.github.io/homeassistant-elasticsearch/
MIT License
145 stars 38 forks source link

Add Datastream, Serverless, and TSDS support #207

Closed strawgate closed 2 months ago

strawgate commented 2 months ago

The integration currently relies on legacy index templates and rollover indices. Ideally, users would default to the latest and greatest for metrics storage in Elastic. As it stands, the integration is currently non functional on serverless.

This PR adds the ability to (as a default for new users), store events as metrics in elasticsearch in a time series datastream while retaining the ability to send data to legacy indices for compatibility.

It also implements a number of other desired capabilities like domain-specific datastreams while ensuring that users can query and visualize over both legacy indices and modern data streams with a single data view. It also adds support for ES Serverless.

This doesn't prevent moving to an integration package later if desirable.

Open Questions:

Tasks:

image

strawgate commented 2 months ago

https://github.com/legrego/homeassistant-elasticsearch/issues/130

strawgate commented 2 months ago

Also adding https://github.com/legrego/homeassistant-elasticsearch/issues/124 to pull each domain into its own data stream

strawgate commented 2 months ago

I think using tsdb, the new time series database, for the events makes sense, I just need to figure out an appropriate set of default dimensions first

strawgate commented 2 months ago

image

legrego commented 2 months ago

Why do a bunch of the hass.properties have ignore_above 124 on them?

When this was written, my knowledge of both ES and Home Assistant were quite lacking. I didn't know what data would exist in the wild for these fields, not how ES would handle them. ignore_above was an attempt to make this component resilient to this unknown data and lack of personal knowledge.

Why does the index template have an event.* mapping? copy/paste?

Yeah this was copy/paste from the original ECS 1.0 specification. I don't believe we use this, so we don't have to carry it forward.

What to do when "All Entities" is picked with Datastreams

Can you elaborate on this one?

strawgate commented 2 months ago

Why do a bunch of the hass.properties have ignore_above 124 on them?

When this was written, my knowledge of both ES and Home Assistant were quite lacking. I didn't know what data would exist in the wild for these fields, not how ES would handle them. ignore_above was an attempt to make this component resilient to this unknown data and lack of personal knowledge.

Ok, I think a bunch of the fields marked as 124 actually cap out at 255, if there's no concern i'll just find/replace 124 to 1024 across the board which is typica for other integrations.

Why does the index template have an event.* mapping? copy/paste?

Yeah this was copy/paste from the original ECS 1.0 specification. I don't believe we use this, so we don't have to carry it forward.

Perfect -- i need to remove it to support TSDS / Synthetic source

What to do when "All Entities" is picked with Datastreams

Can you elaborate on this one?

I misunderstood what All entities was for so i'll remove that from the open questions.

strawgate commented 2 months ago

I've added the choice during integration setup. Datastreams would be the default setting, existing users would be undisturbed, migrating to datastreams would require recreating the integration.

https://github.com/legrego/homeassistant-elasticsearch/assets/6384545/a387a60a-5f81-40b2-8616-3165c6e45b6a

I could probably make it configurable via options and have it reload the integration when the type changes but every second I spend messing with the config/options flow I feel like i'm losing a valuable piece of myself.

strawgate commented 2 months ago

I am able to modify an existing all-hass-events dataview to include the new datastreams like so:

image

and am then able to query both sets of data via that dataview:

image
strawgate commented 2 months ago

Another thought I had, perhaps as a follow-up PR was to use the python type of the entity.value to also populate strongly typed ES fields like entity.valueas.float entity.valueas.int entity.valueas.date entity.valueas.keyword entity.valueas.boolean

This would make it much easier to visualize sensor data

Any thoughts on something like that?

Edit: implemented

strawgate commented 2 months ago

Latest commit should auto-pick datastreams on serverless, autopick legacy indices on pre-8.7 and offer legacy indices or datastreams on stateful >=8.7

legrego commented 2 months ago

Another thought I had, perhaps as a follow-up PR was to use the python type of the entity.value to also populate strongly typed ES fields like entity.valueas.float entity.valueas.int entity.valueas.date entity.valueas.keyword entity.valueas.boolean

This would make it much easier to visualize sensor data

Any thoughts on something like that?

Edit: implemented

@strawgate that's an interesting thought. Can you help me understand the benefits of this approach over the existing approach, or why we would benefit from maintaining both as opposed to one or the other?

If you can't tell, I'm not an ingest expert, so I'm relying on your expertise here.

strawgate commented 2 months ago

@strawgate that's an interesting thought. Can you help me understand the benefits of this approach over the existing approach, or why we would benefit from maintaining both as opposed to one or the other?

So here's a couple areas I think this is beneficial:

  1. If different sensors provide different units for the same type of measurement (one provides a boolean and one provides an int 0 or 1). Right now you get a conflict in Kibana and cannot visualize that type of sensor at all. image
  2. Telling Elasticsearch to make subfields of each data type causes a bunch of other issues in Kibana. The current approach makes a float and keyword field for every attribute which causes this artifact when filtering: image
  3. For certain types like Dates, these are almost entirely brought in as strings which makes them unavailable for histogram usage. By detecting date types and coercing them explicitly we can create valueas.datetime, valueas.date, valueas.time and which would allow visualizations like "What time does the sun set each day" that are currently not possible today.
  4. For native boolean types we can allow the user to more easily make visualizations that say, "Door Locked: True" versus the current (non-customized) behavior which would be "Door Locked: 1"

So I would propose keeping .value and offering optional coerced .valueas types.

This has maintenance downsides -- we may find that python types don't strictly match elasticsearch types and have to make small adjustments to the coercion along the way but these should not block indexing and would just result in the user being left with just the value field for that particular document, which ultimately is just the current behavior without this change.

legrego commented 2 months ago

@strawgate thank you for the detailed response, I found it very helpful.

This has maintenance downsides -- we may find that python types don't strictly match elasticsearch types and have to make small adjustments to the coercion along the way but these should not block indexing and would just result in the user being left with just the value field for that particular document, which ultimately is just the current behavior without this change.

Ah I would have expected indexing issues. If we detect a datetime field in Python, and place this in the .valueas.date field in the document, won't ES reject the document if it's not in a format that it believes to be a valid date?

strawgate commented 2 months ago

@strawgate thank you for the detailed response, I found it very helpful.

This has maintenance downsides -- we may find that python types don't strictly match elasticsearch types and have to make small adjustments to the coercion along the way but these should not block indexing and would just result in the user being left with just the value field for that particular document, which ultimately is just the current behavior without this change.

Ah I would have expected indexing issues. If we detect a datetime field in Python, and place this in the .valueas.date field in the document, won't ES reject the document if it's not in a format that it believes to be a valid date?

We would have python attempt to coerce the value into a datetime object and if that is successful we can force a particular output format of the datetime object to ensure that Elasticsearch will always accept it.

Something like this is working for me on my local branch:

                    # Create a datetime, date and time field if the string is a valid date
                    document_body["hass.entity"]["valueas"]["datetime"] = datetime.fromisoformat(_state).isoformat()

                    document_body["hass.entity"]["valueas"]["date"] = datetime.fromisoformat(_state).date().isoformat()
                    document_body["hass.entity"]["valueas"]["time"] = datetime.fromisoformat(_state).time().isoformat()

Index mapping:

                                        },
                                        "datetime": {
                                            "type": "date"
                                        },
                                        "date": {
                                            "type": "date",
                                            "format": "strict_date"
                                        },
                                        "time": {
                                            "type": "date",
                                            "format": "HH:mm:ss.SSSSSS||time||strict_hour_minute_second||time_no_millis"
                                        },
legrego commented 2 months ago

Sounds good. I'm happy to proceed with this approach for now. Thanks!

legrego commented 2 months ago

https://github.com/legrego/homeassistant-elasticsearch/actions/runs/8189910572/job/22395684828?pr=207#step:6:7

custom_components/elasticsearch/es_doc_creator.py:59:9: C901 state_to_document is too complex (26 > 25) Found 2 errors (1 fixed, 1 remaining). Error: Process completed with exit code 1.

Feel free to add an ignore for this "issue" if that's simpler. I wasn't aware this lint rule was enabled, and I don't have a strong opinion on the complexity threshold at this point.

github-actions[bot] commented 2 months ago

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
elasticsearch
   __init__.py78297%194–195
   config_flow.py2211692%70, 103, 204, 274, 294, 306, 355–356, 360, 399, 405, 439, 508, 517, 561–562
   const.py240100% 
   entity_details.py330100% 
   errors.py25292%47, 60
   es_doc_creator.py77297%158, 169
   es_doc_publisher.py1842188%87, 203–204, 220–221, 226, 231, 287, 293–295, 297, 311–312, 314–315, 317–321
   es_gateway.py1002080%82–83, 98, 102, 118–120, 122, 124–125, 127–131, 133–137
   es_index_manager.py741382%101–102, 113–118, 126–127, 141–142, 162
   es_integration.py37294%43–44
   es_privilege_check.py500100% 
   es_serializer.py100100% 
   es_version.py200100% 
   logger.py20100% 
   system_info.py17194%23
   utils.py40100% 
TOTAL9567991% 

Tests Skipped Failures Errors Time
46 0 :zzz: 0 :x: 0 :fire: 4.029s :stopwatch:
strawgate commented 2 months ago

I'm super unfamiliar with Python tests so maybe I'll take a stab at it this week and if you get some time to add some via a PR I can use those as a launching point?

One thing that would really help me is if you could provide a starting example for each kind of test we will need and then I can copy/paste/run different scenarios.

strawgate commented 2 months ago

Ok, added some tests for doc creator, doc publisher, and es_version. Will try to work on this more later this week.

I will also create a new pre-release build on my branch and make sure everything is still looking good on a real instance

strawgate commented 2 months ago

A follow-up question, now that we have coercion with valueas do we still want to do float coercion on value? Right now value can be float or string, valueas can have boolean (as of a commit coming in a little bit), datetime, date, time, string, float.

My proposal: In v1 mode, value can continue to be a string or float with values like on and off continuing to be coerced to float 0 and 1 In v2 mode, value is always a string and float, boolean, etc are coerced into valueas fields. If no coercions are successful valueas.string will contain the state in addition to value

strawgate commented 2 months ago

Also propose not populating the host.* fields with "UNKNOWN" in datastream documents if the value isn't known

legrego commented 2 months ago

One thing that would really help me is if you could provide a starting example for each kind of test we will need and then I can copy/paste/run different scenarios.

It looks like you beat me to this. Let me know if there's a sample you'd like me to write yet.

A follow-up question, now that we have coercion with valueas do we still want to do float coercion on value? Right now value can be float or string, valueas can have boolean (as of a commit coming in a little bit), datetime, date, time, string, float.

My proposal: In v1 mode, value can continue to be a string or float with values like on and off continuing to be coerced to float 0 and 1 In v2 mode, value is always a string and float, boolean, etc are coerced into valueas fields. If no coercions are successful valueas.string will contain the state in addition to value

Your proposal makes sense to me.

Also propose not populating the host.* fields with "UNKNOWN" in datastream documents if the value isn't known

++

legrego commented 2 months ago

I broke the CI in an attempt to get the code coverage comment working. We won't be able to trust GH action output on this PR

I might have fixed this. If you pull the latest main into this branch, then we'll know for sure. It should at least post the code coverage comment with the correct numbers. The drill-down links into covered/uncovered lines won't work because I'm not storing those assets anywhere, yet.

strawgate commented 2 months ago

I broke the CI in an attempt to get the code coverage comment working. We won't be able to trust GH action output on this PR

I might have fixed this. If you pull the latest main into this branch, then we'll know for sure. It should at least post the code coverage comment with the correct numbers. The drill-down links into covered/uncovered lines won't work because I'm not storing those assets anywhere, yet.

Sounds good, I ran a code coverage report yesterday that showed code coverage at 92% matching main. I'll spend some more time today increasing code coverage.

Also, having code coverage in the pytest ini settings breaks debugging in vscode

strawgate commented 2 months ago

Another edge-case:

In v1 we have hass.geo.location which is set to the hass server location unless an entity has a longitude and latitude property in which case it's overwritten with the entity attribute values.

Proposal 1 (My preference): hass.geo.location is always the hass server location. hass.entity.geo.location is the location from the entity's attributes if they exist and is the hass server location if they do not exist.

Proposal 2 (Alternative): hass.geo.location is always the hass server location. hass.entity.geo.location is the location from the entity's attributes only if they exist

Proposal 1 allows calculating/plotting distance between hass server and individual sensors if those sensors have latitude and longitude provided while placing all entities by default at the hass server location.

legrego commented 2 months ago

Another edge-case:

In v1 we have hass.geo.location which is set to the hass server location unless an entity has a longitude and latitude property in which case it's overwritten with the entity attribute values.

Proposal 1 (My preference): hass.geo.location is always the hass server location. hass.entity.geo.location is the location from the entity's attributes if they exist and is the hass server location if they do not exist.

Proposal 2 (Alternative): hass.geo.location is always the hass server location. hass.entity.geo.location is the location from the entity's attributes only if they exist

Proposal 1 allows calculating/plotting distance between hass server and individual sensors if those sensors have latitude and longitude provided while placing all entities by default at the hass server location.

Good find. I'm happy to move forward with Proposal 1

strawgate commented 2 months ago

So actually i misread the existing code:

in v1 there is host.geo.location which is always the hass server location and hass.geo.location which is populated if the entity has geo location attributes on it. host vs hass

in v2 there will be host.geo.location which will always be the hass server location and hass.entity.geo.location which will be populated if the entity has geo location attributes on it and will fallback to the value in host.geo.location if not.

github-actions[bot] commented 2 months ago

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
elasticsearch
   __init__.py84297%204–205
   config_flow.py2581893%86, 129, 237, 307, 330, 368, 405, 457–458, 462, 502, 508, 518, 548, 623, 632, 676–677
   const.py310100% 
   entity_details.py330100% 
   errors.py25292%47, 60
   es_doc_creator.py1690100% 
   es_doc_publisher.py2321593%97, 213–214, 230–231, 320, 410–411, 413–414, 416–420
   es_gateway.py1002080%82–83, 98, 102, 118–120, 122, 124–125, 127–131, 133–137
   es_index_manager.py1081189%195–200, 208–209, 223–224, 244
   es_integration.py37294%43–44
   es_privilege_check.py500100% 
   es_serializer.py100100% 
   es_version.py220100% 
   logger.py20100% 
   system_info.py17194%23
   utils.py40100% 
TOTAL11827193% 

Tests Skipped Failures Errors Time
92 0 :zzz: 0 :x: 0 :fire: 7.014s :stopwatch:
strawgate commented 2 months ago

Created a pre-release build and running it against my serverless instance shows it working!

Will try to find some time to walk through a legacy setup and a legacy -> datastream conversion

strawgate commented 2 months ago

Ran a quick test with legacy indexing which worked great. Just for my own comfort I setup an integration with legacy indexing, ingested the same entity using mainline and using this PR, and did a JSON Diff and confirmed no change in legacy document format.

strawgate commented 2 months ago

Fixes: https://github.com/legrego/homeassistant-elasticsearch/issues/217 Fixes: https://github.com/legrego/homeassistant-elasticsearch/issues/177 Fixes: https://github.com/legrego/homeassistant-elasticsearch/issues/130 Fixes: https://github.com/legrego/homeassistant-elasticsearch/issues/124