opendistro-for-elasticsearch / index-management

🗃 Open Distro Index Management
https://opendistro.github.io/
Apache License 2.0
115 stars 45 forks source link

Query on rollup index for average aggegation metric is giving incorrect results #440

Open Sreevani871 opened 3 years ago

Sreevani871 commented 3 years ago

Describe the bug Same Aggregation query is being fired on source index and rollup index for aggregation metric values comparision, Results are not matching. Average aggregation query on rollup index giving incorrect results.

Rollup Job Configuration curl -XPUT "localhost:9200/_opendistro/_rollup/jobs/rollup-test?pretty" -H "Content-Type:application/json" -d '{ "rollup": { "enabled": true, "schedule": { "cron": { "expression": "*/1 * * * *", "timezone":"UTC" } }, "description": "Test rollup job", "source_index": "jaeger-span-2021.04.17-000103", "target_index": "rollup-test", "page_size": 5000, "delay": 300, "continuous": false, "dimensions": [ { "date_histogram": { "source_field": "startTimeMillis", "fixed_interval": "1h", "timezone": "UTC" } }, { "terms": { "source_field": "process.serviceName" } }, { "terms": { "source_field": "process.tag.application@version" } }, { "terms": { "source_field": "operationName" } }, { "terms": { "source_field": "exception.type" } }, { "terms": { "source_field": "exception.message" } } ], "metrics": [ { "source_field": "duration", "metrics": [ { "avg": {} }, { "max": {} }, { "min": {} }, { "sum": {} }, { "value_count": {} } ] } ] } } ' Query on Rollup Index Request curl -X GET "localhost:9200/rollup-test/_search?pretty&size=0" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "terms": { "process.serviceName": [ "service-xxxxxx" ] } } ] } }, "aggregations": { "timeline": { "date_histogram": { "field": "startTimeMillis", "fixed_interval": "1h" }, "aggs": { "service": { "terms": { "field": "process.serviceName" }, "aggs": { "avg_duration": { "avg": { "field": "duration" } }, "max_duration": { "max": { "field": "duration" } }, "min_duration": { "min": { "field": "duration" } }, "count": { "value_count": { "field": "duration" } }, "sum": { "sum": { "field": "duration" } } } } } } } }' Response rollup-index-response.txt

Query on Source Index Request curl -X GET "localhost:9200/jaeger-span-2021.04.17-000103/_search?pretty&size=0" -H 'Content-Type: application/json' -d' { "query": { "bool": { "must": [ { "terms": { "process.serviceName": [ "service-xxxxxx" ] } } ] } }, "aggregations": { "timeline": { "date_histogram": { "field": "startTimeMillis", "fixed_interval": "1h" }, "aggs": { "service": { "terms": { "field": "process.serviceName" }, "aggs": { "avg_duration": { "avg": { "field": "duration" } }, "max_duration": { "max": { "field": "duration" } }, "min_duration": { "min": { "field": "duration" } }, "count": { "value_count": { "field": "duration" } }, "sum": { "sum": { "field": "duration" } } } } } } } }'

Response source-index-response.txt

Setup Details

All other metrics SUM, VALUE_COUNT, MIN, MAX are giving correct results and matching with aggregation metrics of source index. Only Average is giving incorrect results. Consider following example taken from response of Rollup index query: { "key_as_string" : "2021-04-17T02:00:00.000Z", "key" : 1618624800000, "doc_count" : 562, "service" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "service-xxxxxx", "doc_count" : 562, "avg_duration" : { "value" : 754.1463076048377 }, "count" : { "value" : 2847569 }, "min_duration" : { "value" : 37.0 }, "sum" : { "value" : 1.5818190941E10 }, "max_duration" : { "value" : 2.07551568E8 } } ] } } Here the expected avg_duration: 1.5818190941E10 / 2847569 = 5,554.9807365511 but the actual value resulted in response is avg_duration = 754.1463076048377

Can anyone explain the reason behind this discrepancy?

RashmiRam commented 3 years ago

This line https://github.com/opendistro-for-elasticsearch/index-management/blob/v1.12.0.0/src/main/kotlin/com/amazon/opendistroforelasticsearch/indexmanagement/rollup/util/RollupUtils.kt#L246 should be changed to state.sums = 0L; state.counts = 0L;

Ref: https://www.elastic.co/guide/en/elasticsearch/painless/7.10/painless-literals.html#integer-literals Ref: https://github.com/elastic/elasticsearch/issues/27199

All the aggs which shows wrong value for avg assumes sum as 2147483647 and divide that by count. Resulting in wrong values. This can be verified by multiplying the avg with count to arrive at this number(2147483647) for sum (For each wrong avg values in rolled up search)

Sreevani871 commented 3 years ago

Any help here @dbbaughe ? One more issue is with the delay field in rollup job configuration, When I configured the job with continuous field set true and delay field set to 300000(milliseconds), The execution of the job is not honouring the delay time. In code delay field type is defined as long. What time-unit does it get converted during execution?

Sreevani871 commented 3 years ago

Any help here?