Add Datastream, Serverless, and TSDS support

strawgate commented 2 months ago

The integration currently relies on legacy index templates and rollover indices. Ideally, users would default to the latest and greatest for metrics storage in Elastic. As it stands, the integration is currently non functional on serverless.

This PR adds the ability to (as a default for new users), store events as metrics in elasticsearch in a time series datastream while retaining the ability to send data to legacy indices for compatibility.

It also implements a number of other desired capabilities like domain-specific datastreams while ensuring that users can query and visualize over both legacy indices and modern data streams with a single data view. It also adds support for ES Serverless.

This doesn't prevent moving to an integration package later if desirable.

Open Questions:

[x] How to handle backwards compatibility
[x] Whether to use TSDB for events
[x] Why do a bunch of the hass.properties have ignore_above 124 on them?
[x] Why does the index template have an event.* mapping? copy/paste?
[x] Why doesn't the data_retention setting work with a Time Series Datastream

Tasks:

[x] Refactor Legacy Index Support
- [x] Refactor old index template code to reference "Legacy Templates"
- [x] Migrate any changed configuration settings
[x] Support Datastreams
- [x] Create Index Template named "homeassistant-template"
- [x] Fail integration if Datastream Index Template can't be loaded
- [x] Determine default Datastream retention, settings, and mappings
- [x] Increase ignore_above from 124 to 1024
- [x] Create domain-specific Datastreams "metrics-homeassistant.sensor-default", "metrics-homeassistant.update-default"
- [x] Allow custom component templates
[x] Support Legacy mode and Datastream mode side-by-side
- [x] Add config prompts for Datastream settings
- [x] Confirm working with ES Standalone
- [x] Confirm working with ES Cloud
- [x] Confirm working with ES Serverless
- [x] Automatically pick datastreams on serverless
- [x] Automatically pick legacy mode on pre-8.7
- [x] Allow users to pick modes on >=8.7
- [x] Test switching from legacy indices to datastream and document method for using data from both (update index pattern in data view)
[x] Create new Document Structure for use with TSDS
- [x] Remove event.* mapping to allow use of TSDS
- [x] Set domain and entity.id as dimensions for the datastream time series
- [x] Populate a valueas field with strict elasticsearch types to make visualizing sensor data easier
[x] Write tests
- [x] config_flow
- [x] doc_creator
- [x] doc_publisher
- [x] index_manager

strawgate commented 2 months ago

https://github.com/legrego/homeassistant-elasticsearch/issues/130

strawgate commented 2 months ago

Also adding https://github.com/legrego/homeassistant-elasticsearch/issues/124 to pull each domain into its own data stream

strawgate commented 2 months ago

I think using tsdb, the new time series database, for the events makes sense, I just need to figure out an appropriate set of default dimensions first

strawgate commented 2 months ago

legrego commented 2 months ago

Why do a bunch of the hass.properties have ignore_above 124 on them?

When this was written, my knowledge of both ES and Home Assistant were quite lacking. I didn't know what data would exist in the wild for these fields, not how ES would handle them. ignore_above was an attempt to make this component resilient to this unknown data and lack of personal knowledge.

Why does the index template have an event.* mapping? copy/paste?

Yeah this was copy/paste from the original ECS 1.0 specification. I don't believe we use this, so we don't have to carry it forward.

What to do when "All Entities" is picked with Datastreams

Can you elaborate on this one?

strawgate commented 2 months ago

Why do a bunch of the hass.properties have ignore_above 124 on them?

When this was written, my knowledge of both ES and Home Assistant were quite lacking. I didn't know what data would exist in the wild for these fields, not how ES would handle them. ignore_above was an attempt to make this component resilient to this unknown data and lack of personal knowledge.

Ok, I think a bunch of the fields marked as 124 actually cap out at 255, if there's no concern i'll just find/replace 124 to 1024 across the board which is typica for other integrations.

Why does the index template have an event.* mapping? copy/paste?

Yeah this was copy/paste from the original ECS 1.0 specification. I don't believe we use this, so we don't have to carry it forward.

Perfect -- i need to remove it to support TSDS / Synthetic source

What to do when "All Entities" is picked with Datastreams

Can you elaborate on this one?

I misunderstood what All entities was for so i'll remove that from the open questions.

strawgate commented 2 months ago

I've added the choice during integration setup. Datastreams would be the default setting, existing users would be undisturbed, migrating to datastreams would require recreating the integration.

https://github.com/legrego/homeassistant-elasticsearch/assets/6384545/a387a60a-5f81-40b2-8616-3165c6e45b6a

I could probably make it configurable via options and have it reload the integration when the type changes but every second I spend messing with the config/options flow I feel like i'm losing a valuable piece of myself.

strawgate commented 2 months ago

I am able to modify an existing all-hass-events dataview to include the new datastreams like so:

and am then able to query both sets of data via that dataview:

strawgate commented 2 months ago

Another thought I had, perhaps as a follow-up PR was to use the python type of the entity.value to also populate strongly typed ES fields like entity.valueas.float entity.valueas.int entity.valueas.date entity.valueas.keyword entity.valueas.boolean

This would make it much easier to visualize sensor data

Any thoughts on something like that?

Edit: implemented

strawgate commented 2 months ago

Latest commit should auto-pick datastreams on serverless, autopick legacy indices on pre-8.7 and offer legacy indices or datastreams on stateful >=8.7

legrego commented 2 months ago

Another thought I had, perhaps as a follow-up PR was to use the python type of the entity.value to also populate strongly typed ES fields like entity.valueas.float entity.valueas.int entity.valueas.date entity.valueas.keyword entity.valueas.boolean

This would make it much easier to visualize sensor data

Any thoughts on something like that?

Edit: implemented

@strawgate that's an interesting thought. Can you help me understand the benefits of this approach over the existing approach, or why we would benefit from maintaining both as opposed to one or the other?

If you can't tell, I'm not an ingest expert, so I'm relying on your expertise here.

strawgate commented 2 months ago

@strawgate that's an interesting thought. Can you help me understand the benefits of this approach over the existing approach, or why we would benefit from maintaining both as opposed to one or the other?

So here's a couple areas I think this is beneficial:

If different sensors provide different units for the same type of measurement (one provides a boolean and one provides an int 0 or 1). Right now you get a conflict in Kibana and cannot visualize that type of sensor at all.
Telling Elasticsearch to make subfields of each data type causes a bunch of other issues in Kibana. The current approach makes a float and keyword field for every attribute which causes this artifact when filtering:
For certain types like Dates, these are almost entirely brought in as strings which makes them unavailable for histogram usage. By detecting date types and coercing them explicitly we can create valueas.datetime, valueas.date, valueas.time and which would allow visualizations like "What time does the sun set each day" that are currently not possible today.
For native boolean types we can allow the user to more easily make visualizations that say, "Door Locked: True" versus the current (non-customized) behavior which would be "Door Locked: 1"

So I would propose keeping .value and offering optional coerced .valueas types.

This has maintenance downsides -- we may find that python types don't strictly match elasticsearch types and have to make small adjustments to the coercion along the way but these should not block indexing and would just result in the user being left with just the value field for that particular document, which ultimately is just the current behavior without this change.

legrego commented 2 months ago

@strawgate thank you for the detailed response, I found it very helpful.

This has maintenance downsides -- we may find that python types don't strictly match elasticsearch types and have to make small adjustments to the coercion along the way but these should not block indexing and would just result in the user being left with just the value field for that particular document, which ultimately is just the current behavior without this change.

Ah I would have expected indexing issues. If we detect a datetime field in Python, and place this in the .valueas.date field in the document, won't ES reject the document if it's not in a format that it believes to be a valid date?

strawgate commented 2 months ago

@strawgate thank you for the detailed response, I found it very helpful.

This has maintenance downsides -- we may find that python types don't strictly match elasticsearch types and have to make small adjustments to the coercion along the way but these should not block indexing and would just result in the user being left with just the value field for that particular document, which ultimately is just the current behavior without this change.

Ah I would have expected indexing issues. If we detect a datetime field in Python, and place this in the .valueas.date field in the document, won't ES reject the document if it's not in a format that it believes to be a valid date?

We would have python attempt to coerce the value into a datetime object and if that is successful we can force a particular output format of the datetime object to ensure that Elasticsearch will always accept it.

Something like this is working for me on my local branch:

                    # Create a datetime, date and time field if the string is a valid date
                    document_body["hass.entity"]["valueas"]["datetime"] = datetime.fromisoformat(_state).isoformat()

                    document_body["hass.entity"]["valueas"]["date"] = datetime.fromisoformat(_state).date().isoformat()
                    document_body["hass.entity"]["valueas"]["time"] = datetime.fromisoformat(_state).time().isoformat()

Index mapping:

                                        },
                                        "datetime": {
                                            "type": "date"
                                        },
                                        "date": {
                                            "type": "date",
                                            "format": "strict_date"
                                        },
                                        "time": {
                                            "type": "date",
                                            "format": "HH:mm:ss.SSSSSS||time||strict_hour_minute_second||time_no_millis"
                                        },

legrego commented 2 months ago

Sounds good. I'm happy to proceed with this approach for now. Thanks!

legrego commented 2 months ago

https://github.com/legrego/homeassistant-elasticsearch/actions/runs/8189910572/job/22395684828?pr=207#step:6:7

custom_components/elasticsearch/es_doc_creator.py:59:9: C901 state_to_document is too complex (26 > 25) Found 2 errors (1 fixed, 1 remaining). Error: Process completed with exit code 1.

Feel free to add an ignore for this "issue" if that's simpler. I wasn't aware this lint rule was enabled, and I don't have a strong opinion on the complexity threshold at this point.

github-actions[bot] commented 2 months ago

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
elasticsearch
__init__.py	78	2	97%	194–195
config_flow.py	221	16	92%	70, 103, 204, 274, 294, 306, 355–356, 360, 399, 405, 439, 508, 517, 561–562
const.py	24	0	100%
entity_details.py	33	0	100%
errors.py	25	2	92%	47, 60
es_doc_creator.py	77	2	97%	158, 169
es_doc_publisher.py	184	21	88%	87, 203–204, 220–221, 226, 231, 287, 293–295, 297, 311–312, 314–315, 317–321
es_gateway.py	100	20	80%	82–83, 98, 102, 118–120, 122, 124–125, 127–131, 133–137
es_index_manager.py	74	13	82%	101–102, 113–118, 126–127, 141–142, 162
es_integration.py	37	2	94%	43–44
es_privilege_check.py	50	0	100%
es_serializer.py	10	0	100%
es_version.py	20	0	100%
logger.py	2	0	100%
system_info.py	17	1	94%	23
utils.py	4	0	100%
TOTAL	956	79	91%

Tests	Skipped	Failures	Errors	Time
46	0 :zzz:	0 :x:	0 :fire:	4.029s :stopwatch:

strawgate commented 2 months ago

I'm super unfamiliar with Python tests so maybe I'll take a stab at it this week and if you get some time to add some via a PR I can use those as a launching point?

One thing that would really help me is if you could provide a starting example for each kind of test we will need and then I can copy/paste/run different scenarios.

strawgate commented 2 months ago

Ok, added some tests for doc creator, doc publisher, and es_version. Will try to work on this more later this week.

I will also create a new pre-release build on my branch and make sure everything is still looking good on a real instance

strawgate commented 2 months ago

A follow-up question, now that we have coercion with valueas do we still want to do float coercion on value? Right now value can be float or string, valueas can have boolean (as of a commit coming in a little bit), datetime, date, time, string, float.

My proposal: In v1 mode, value can continue to be a string or float with values like on and off continuing to be coerced to float 0 and 1 In v2 mode, value is always a string and float, boolean, etc are coerced into valueas fields. If no coercions are successful valueas.string will contain the state in addition to value

strawgate commented 2 months ago

Also propose not populating the host.* fields with "UNKNOWN" in datastream documents if the value isn't known

legrego commented 2 months ago

One thing that would really help me is if you could provide a starting example for each kind of test we will need and then I can copy/paste/run different scenarios.

It looks like you beat me to this. Let me know if there's a sample you'd like me to write yet.

A follow-up question, now that we have coercion with valueas do we still want to do float coercion on value? Right now value can be float or string, valueas can have boolean (as of a commit coming in a little bit), datetime, date, time, string, float.

My proposal: In v1 mode, value can continue to be a string or float with values like on and off continuing to be coerced to float 0 and 1 In v2 mode, value is always a string and float, boolean, etc are coerced into valueas fields. If no coercions are successful valueas.string will contain the state in addition to value

Your proposal makes sense to me.

Also propose not populating the host.* fields with "UNKNOWN" in datastream documents if the value isn't known

++

legrego commented 2 months ago

I broke the CI in an attempt to get the code coverage comment working. We won't be able to trust GH action output on this PR

I might have fixed this. If you pull the latest main into this branch, then we'll know for sure. It should at least post the code coverage comment with the correct numbers. The drill-down links into covered/uncovered lines won't work because I'm not storing those assets anywhere, yet.

strawgate commented 2 months ago

I broke the CI in an attempt to get the code coverage comment working. We won't be able to trust GH action output on this PR

I might have fixed this. If you pull the latest main into this branch, then we'll know for sure. It should at least post the code coverage comment with the correct numbers. The drill-down links into covered/uncovered lines won't work because I'm not storing those assets anywhere, yet.

Sounds good, I ran a code coverage report yesterday that showed code coverage at 92% matching main. I'll spend some more time today increasing code coverage.

Also, having code coverage in the pytest ini settings breaks debugging in vscode

strawgate commented 2 months ago

Another edge-case:

In v1 we have hass.geo.location which is set to the hass server location unless an entity has a longitude and latitude property in which case it's overwritten with the entity attribute values.

Proposal 1 (My preference): hass.geo.location is always the hass server location. hass.entity.geo.location is the location from the entity's attributes if they exist and is the hass server location if they do not exist.

Proposal 2 (Alternative): hass.geo.location is always the hass server location. hass.entity.geo.location is the location from the entity's attributes only if they exist

Proposal 1 allows calculating/plotting distance between hass server and individual sensors if those sensors have latitude and longitude provided while placing all entities by default at the hass server location.

legrego commented 2 months ago

Another edge-case:

In v1 we have hass.geo.location which is set to the hass server location unless an entity has a longitude and latitude property in which case it's overwritten with the entity attribute values.

Proposal 1 (My preference): hass.geo.location is always the hass server location. hass.entity.geo.location is the location from the entity's attributes if they exist and is the hass server location if they do not exist.

Proposal 2 (Alternative): hass.geo.location is always the hass server location. hass.entity.geo.location is the location from the entity's attributes only if they exist

Proposal 1 allows calculating/plotting distance between hass server and individual sensors if those sensors have latitude and longitude provided while placing all entities by default at the hass server location.

Good find. I'm happy to move forward with Proposal 1

strawgate commented 2 months ago

So actually i misread the existing code:

in v1 there is host.geo.location which is always the hass server location and hass.geo.location which is populated if the entity has geo location attributes on it. host vs hass

in v2 there will be host.geo.location which will always be the hass server location and hass.entity.geo.location which will be populated if the entity has geo location attributes on it and will fallback to the value in host.geo.location if not.

github-actions[bot] commented 2 months ago

Coverage Report

File	Stmts	Miss	Cover	Missing
__init__.py	0	0	100%
elasticsearch
__init__.py	84	2	97%	204–205
config_flow.py	258	18	93%	86, 129, 237, 307, 330, 368, 405, 457–458, 462, 502, 508, 518, 548, 623, 632, 676–677
const.py	31	0	100%
entity_details.py	33	0	100%
errors.py	25	2	92%	47, 60
es_doc_creator.py	169	0	100%
es_doc_publisher.py	232	15	93%	97, 213–214, 230–231, 320, 410–411, 413–414, 416–420
es_gateway.py	100	20	80%	82–83, 98, 102, 118–120, 122, 124–125, 127–131, 133–137
es_index_manager.py	108	11	89%	195–200, 208–209, 223–224, 244
es_integration.py	37	2	94%	43–44
es_privilege_check.py	50	0	100%
es_serializer.py	10	0	100%
es_version.py	22	0	100%
logger.py	2	0	100%
system_info.py	17	1	94%	23
utils.py	4	0	100%
TOTAL	1182	71	93%

Tests	Skipped	Failures	Errors	Time
92	0 :zzz:	0 :x:	0 :fire:	7.014s :stopwatch:

strawgate commented 2 months ago

Created a pre-release build and running it against my serverless instance shows it working!

Will try to find some time to walk through a legacy setup and a legacy -> datastream conversion

strawgate commented 2 months ago

Ran a quick test with legacy indexing which worked great. Just for my own comfort I setup an integration with legacy indexing, ingested the same entity using mainline and using this PR, and did a JSON Diff and confirmed no change in legacy document format.

strawgate commented 2 months ago

Fixes: https://github.com/legrego/homeassistant-elasticsearch/issues/217 Fixes: https://github.com/legrego/homeassistant-elasticsearch/issues/177 Fixes: https://github.com/legrego/homeassistant-elasticsearch/issues/130 Fixes: https://github.com/legrego/homeassistant-elasticsearch/issues/124

legrego / homeassistant-elasticsearch

Add Datastream, Serverless, and TSDS support #207