opensearch-project / index-management

🗃 Automate periodic data operations, such as deleting indices at a certain age or performing a rollover at a certain size
https://opensearch.org/docs/latest/im-plugin/index/
Apache License 2.0
53 stars 111 forks source link

[BUG] Incomplete search results against rollup index composed of data from multiple rollup jobs #903

Open sharebear opened 1 year ago

sharebear commented 1 year ago

Describe the bug Incomplete results when querying rollup index with date histogram.

To Reproduce

Enter the following into the dev console and execute each query in sequence (waiting for completion of rollup jobs at that step)

# Insert some test data

POST sharej-data-2023-07-25/_doc
{
  "timestamp": "2023-07-25T14:10:43.1",
  "numberOfCalls": 1
}

POST sharej-data-2023-07-25/_doc
{
  "timestamp": "2023-07-25T14:12:43.1",
  "numberOfCalls": 1
}

POST sharej-data-2023-07-24/_doc
{
  "timestamp": "2023-07-24T13:14:43.1",
  "numberOfCalls": 4
}

POST sharej-data-2023-07-23/_doc
{
  "timestamp": "2023-07-23T15:11:43.1",
  "numberOfCalls": 2
}

POST sharej-data-2023-07-23/_doc
{
  "timestamp": "2023-07-23T17:18:43.1",
  "numberOfCalls": 4
}

# Create rollup job for above indexes, emulating what happens when you have an ISM policy applying the rollup to each index after X days
PUT _plugins/_rollup/jobs/sharej-rollup-2023-07-23
{
  "rollup": {
    "enabled": true,
    "source_index": "sharej-data-2023-07-23",
    "target_index": "rollup-sharej-data-2023",
    "schedule": {
      "interval": {
        "start_time": 1,
        "period": "1",
        "unit": "Minutes"
      }
    },
    "description": "Test rollup",
    "page_size": 1000,
    "delay": 0,
    "continuous": false,
    "dimensions": [
      {
        "date_histogram": {
          "source_field": "timestamp",
          "fixed_interval": "1h",
          "timezone": "UTC"
        }
      }
    ],
    "metrics": [
      {
        "source_field": "numberOfCalls",
        "metrics": [
          {
            "avg": {}
          },
          {
            "sum": {}
          },
          {
            "max": {}
          },
          {
            "min": {}
          },
          {
            "value_count": {}
          }
        ]
      }
    ]
  }
}

PUT _plugins/_rollup/jobs/sharej-rollup-2023-07-24
{
  "rollup": {
    "enabled": true,
    "source_index": "sharej-data-2023-07-24",
    "target_index": "rollup-sharej-data-2023",
    "schedule": {
      "interval": {
        "start_time": 1,
        "period": "1",
        "unit": "Minutes"
      }
    },
    "description": "Test rollup",
    "page_size": 1000,
    "delay": 0,
    "continuous": false,
    "dimensions": [
      {
        "date_histogram": {
          "source_field": "timestamp",
          "fixed_interval": "1h",
          "timezone": "UTC"
        }
      }
    ],
    "metrics": [
      {
        "source_field": "numberOfCalls",
        "metrics": [
          {
            "avg": {}
          },
          {
            "sum": {}
          },
          {
            "max": {}
          },
          {
            "min": {}
          },
          {
            "value_count": {}
          }
        ]
      }
    ]
  }
}

PUT _plugins/_rollup/jobs/sharej-rollup-2023-07-25
{
  "rollup": {
    "enabled": true,
    "source_index": "sharej-data-2023-07-25",
    "target_index": "rollup-sharej-data-2023",
    "schedule": {
      "interval": {
        "start_time": 1,
        "period": "1",
        "unit": "Minutes"
      }
    },
    "description": "Test rollup",
    "page_size": 1000,
    "delay": 0,
    "continuous": false,
    "dimensions": [
      {
        "date_histogram": {
          "source_field": "timestamp",
          "fixed_interval": "1h",
          "timezone": "UTC"
        }
      }
    ],
    "metrics": [
      {
        "source_field": "numberOfCalls",
        "metrics": [
          {
            "avg": {}
          },
          {
            "sum": {}
          },
          {
            "max": {}
          },
          {
            "min": {}
          },
          {
            "value_count": {}
          }
        ]
      }
    ]
  }
}

# Watch status of rollup jobs until complete

GET _plugins/_rollup/jobs/sharej-rollup-2023-07-23/_explain

GET _plugins/_rollup/jobs/sharej-rollup-2023-07-24/_explain

GET _plugins/_rollup/jobs/sharej-rollup-2023-07-25/_explain

# Execute query against source data. Three buckets returned
GET sharej-data-2023-*/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "by_day": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "1d"
      },
      "aggregations": {
        "totalCalls": {
          "sum": {
            "field": "numberOfCalls"
          }
        }
      }
    }
  }
}

# Execute query against rollup data. Only 1 bucket returned!?!?!? Where's the rest of the data?
GET rollup-sharej-data-2023/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "by_day": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "1d"
      },
      "aggregations": {
        "totalCalls": {
          "sum": {
            "field": "numberOfCalls"
          }
        }
      }
    }
  }
}

# Execute against rollup data without query. Expected result again (but this isn't the query we get when adding a visualisation)
GET rollup-sharej-data-2023/_search
{
  "size": 0,
  "aggregations": {
    "by_day": {
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "1d"
      },
      "aggregations": {
        "totalCalls": {
          "sum": {
            "field": "numberOfCalls"
          }
        }
      }
    }
  }
}

Expected behavior

All three queries at the end should return the same results. What appears to be happening is that the results from only one of the rollup jobs are returned when the query parameter is provided to the search against the rollup index.

Host/Environment (please complete the following information):

Additional context We've got some metrics that we have posted to daily indexes. We have an ISM policy applied to the daily index pattern that after three days, performs a rollup to an annual index and deletes the source index. When trying to create visualisations based upon the rollup index we're getting strange results. When hand crafting a search against the rollup index I'm able to see that all the expected data is there, but when placing the equivalent query via a visualisation on a dashboard we're missing data. The difference between my hand-crafted search and the search from the dashboard is the presence of the query field that narrows down the time-frame and optionally drills down on other facets (not included in code example above). How do we get our visualisations to show all the data, or have I stubled upon a genuine bug here?

msfroh commented 1 year ago

Should we move this to https://github.com/opensearch-project/index-management ?

KagariSan commented 4 months ago

Hi @sharebear and @msfroh,

I've identified the cause of the reported behavior and believe this issue can now be closed.

The behavior is related to the code snippet found here:

https://github.com/opensearch-project/index-management/blob/d4ee795e22f4490b78662f171f62d566a81c1abc/src/main/kotlin/org/opensearch/indexmanagement/rollup/interceptor/RollupInterceptor.kt#L347

This code references the setting "plugins.rollup.search.search_all_jobs", documented here:

https://opensearch.org/docs/2.4/im-plugin/index-rollups/settings/

To modify the current behavior, you can update your cluster configuration using the following API call:

PUT https://localhost:9200/_cluster/settings
Content-Type: application/json

{
  "persistent": {
    "plugins.rollup.search.search_all_jobs": true
  },
  "transient": {
    "plugins.rollup.search.search_all_jobs": true
  }
}

This update will enable searching across all rollup jobs, both persistently and transiently.

Please don't hesitate to let me know if you have any questions or need any more help.

sharebear commented 3 months ago

Thanks, I've confirmed that the setting does seem to resolve my issue in local testing, just need to work out how to get that set in my Aiven hosted instance (not your problem)