opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
53 stars 112 forks source link

Transform job aggregations for missing field causing issues... #1273

Open MahendraAkkina opened 1 month ago

MahendraAkkina commented 1 month ago

What is the bug? In the transforms job, the min, max, avg aggregations on a missing field is resulting in -Infinity, Infinity, NaN. Also value_count and sum results in 0.

The issue is - the target index is being populated with the new fields with such values (along with making some of the fields mapping set to TEXT). It behaves better with setting “missing”: 0 for numeric fields in the agg function but it’s not ideal as it misrepresents the data.

What I really want is for the missing fields based fields not to be in target index at all for those documents. Is there a way to accomplish this? This is kind of a show stopper the default behavior will not work and using missing misrepresents the data.

How can one reproduce the bug? Steps to reproduce the behavior: Here is an example: Transform job:

{
    "transform": {
        "enabled": true,
        "continuous": true,
        "schedule": {
            "interval": {
                "period": 5,
                "unit": "Minutes"
            }
        },
        "description": "Sample transform job",
        "source_index": "sample",
        "target_index": "sample_transform",
        "data_selection_query": {
            "match_all": {}
        },
        "page_size": 1,
        "groups": [
            {
                "date_histogram": {
                    "source_field": "timestamp",
                    "fixed_interval": "60m",
                    "timezone": "UTC"
                }
            },
            {
                "terms": {
                    "source_field": "device.keyword",
                    "target_field": "device"
                }
            }
        ],
        "aggregations": {
            "m1_value_count": {
                "value_count": {
                    "field": "m1"
                }
            },
            "m1_avg": {
                "avg": {
                    "field": "m1"
                }
            },
            "m1_max": {
                "max": {
                    "field": "m1"
                }
            },
            "m1_min": {
                "min": {
                    "field": "m1"
                }
            },
            "m1_sum": {
                "sum": {
                    "field": "m1"
                }
            },
            "m3_value_count": {
                "value_count": {
                    "field": "m3"
                }
            },
            "m3_avg": {
                "avg": {
                    "field": "m3"
                }
            },
            "m3_max": {
                "max": {
                    "field": "m3"
                }
            },
            "m3_min": {
                "min": {
                    "field": "m3"
                }
            },
            "m3_sum": {
                "sum": {
                    "field": "m3"
                }
            }
        }
    }
}

In the target index, you can see m3 related fields showing up a certain way when m3 is missing in the time interval.

"_source": {
                    "transform._id": "metric_all_3_transform_job",
                    "_doc_count": 22,
                    "transform._doc_count": 22,
                    "timestamp": 1728007200000,
                    "device": "1.1.1.1",
                    "m1_max": 99.13,
                    "m1_min": 17.66,
                    "m1_avg": 56.58500000000001,
                    "m1_value_count": 22.0,
                    "m1_sum": 1244.8700000000001
                    "m3_max": "-Infinity",
                    "m3_min": "Infinity",
                    "m3_avg": "NaN",
                    "m3_sum": 0.0,
                    "m3_value_count": 0.0
                }

What is the expected behavior? Not generate any of the m3* fields in such cases.

What is your host/environment? v2.11

bharath-techie commented 1 month ago

[ Triage attendees - 1 2 3 4]

One solution is to add a flag to skip adding missing fields as part of the transormed documents.

MahendraAkkina commented 1 month ago

@bharath-techie Is this option available now (or is there a work around to achieve this) or you are just suggesting enhancing the support by adding an option to skip?

MahendraAkkina commented 1 month ago

This is kind of a show stopper for us. Any thoughts from anyone?