opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.44k stars 1.73k forks source link

[BUG] Unable to use date_histogram as a transform group #8138

Open Cobraeti opened 1 year ago

Cobraeti commented 1 year ago

Describe the bug When trying to create a transform job on the "Sample web logs" index to aggregate on 5min buckets, I get the following response, even if the documentation states that date_histogram is available and has a "field" field:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Invalid field [field] found in date histogram"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Invalid field [field] found in date histogram"
  },
  "status" : 400
}

To Reproduce Steps to reproduce the behavior (on Opensearch-Dashboards):

  1. Generate the "Sample web logs" though the "Add sample data" menu
  2. Create the following transform job (I'm using "Dev Tools" > "Console"):
    PUT _plugins/_transform/agregate_level_1
    {
    "transform": {
      "enabled": true,
      "continuous": false,
      "schedule": {
        "interval": {
          "period": 1,
          "unit": "Minutes",
          "start_time": 1602100553
        }
      },
      "description": "Agregate data on 5min buckets",
      "source_index": "opensearch_dashboards_sample_data_logs",
      "target_index": "transform_level_1",
      "data_selection_query": {
        "match_all": {}
      },
      "page_size": 1,
      "groups": [
        {
          "date_histogram": {
            "field": "@timestamp",
            "interval": "5m"
          }
        }
      ],
      "aggregations": {
        "data_transfer": {
          "sum": {
            "field": "bytes"
          }
        }
      }
    }
    }
  3. You get the error as reply:
    {
    "error" : {
      "root_cause" : [
        {
          "type" : "illegal_argument_exception",
          "reason" : "Invalid field [field] found in date histogram"
        }
      ],
      "type" : "illegal_argument_exception",
      "reason" : "Invalid field [field] found in date histogram"
    },
    "status" : 400
    }

    Note: the result is the same with "field": "timestamp" instead of "field": "@timestamp" (both objects exist and have the same type/value)

Expected behavior The transform job is created and produce a new index with buckets of 5min

Plugins

opensearch-cluster-master-0 opensearch-alerting                  1.3.10.0
opensearch-cluster-master-0 opensearch-anomaly-detection         1.3.10.0
opensearch-cluster-master-0 opensearch-asynchronous-search       1.3.10.0
opensearch-cluster-master-0 opensearch-cross-cluster-replication 1.3.10.0
opensearch-cluster-master-0 opensearch-index-management          1.3.10.0
opensearch-cluster-master-0 opensearch-job-scheduler             1.3.10.0
opensearch-cluster-master-0 opensearch-knn                       1.3.10.0
opensearch-cluster-master-0 opensearch-ml                        1.3.10.0
opensearch-cluster-master-0 opensearch-observability             1.3.10.0
opensearch-cluster-master-0 opensearch-performance-analyzer      1.3.10.0
opensearch-cluster-master-0 opensearch-reports-scheduler         1.3.10.0
opensearch-cluster-master-0 opensearch-security                  1.3.10.0
opensearch-cluster-master-0 opensearch-sql                       1.3.10.0

Screenshots NC

Host/Environment:

Additional context In the end I would even like to aggregate on more fields to have the following groups if it's allowed:

    "groups": [
      {
        "date_histogram": {
          "field": "@timestamp",
          "interval": "5m"
        }
      },
      {
        "terms": {
          "source_field": "url.keyword",
          "target_field": "url"
        }
      },
      {
        "terms": {
          "source_field": "clientip",
          "target_field": "ip"
        }
      }
    ],

I can't upgrade to a much greater version than 1.3.10 for now, I'm bound to it because of my service provider... :disappointed:

Cobraeti commented 1 year ago

Hello, I also reproduced this behavior with the following chart versions (+ adding the corresponding appVersion for each):

Cobraeti commented 1 year ago

Hello, here are some new elements... I checked also through the transform job creation webUI (Opensearch Dashboards > Index Management > Transform Jobs) for all the above versions and here are my results:

I'm really surprised this feature is not supported, as it is supposed to be available according to the documentation of all those versions:

All are stating the same:

Option Data Type Description Required
groups Array Specifies the grouping(s) to use in the transform job. Supported groups are terms, histogram, and date_histogram. For more information, see Bucket Aggregations. Yes if not using aggregations.
Cobraeti commented 1 year ago

Hello, it seems the issue is more the way parameters are documented and/or errors reported, as when removing all parameters a more useful error is shown:

PUT _plugins/_transform/agregate_level_1
{
  "transform": {
    "enabled": true,
    "continuous": false,
    "schedule": {
      "interval": {
        "period": 1,
        "unit": "Minutes",
        "start_time": 1602100553
      }
    },
    "description": "Agregate data on 5min buckets",
    "source_index": "opensearch_dashboards_sample_data_logs",
    "target_index": "transform_level_1",
    "data_selection_query": {
      "match_all": {}
    },
    "page_size": 1,
    "groups": [
      {
        "date_histogram": {}
      }
    ],
    "aggregations": {
      "data_transfer": {
        "sum": {
          "field": "bytes"
        }
      }
    }
  }
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Source field must not be null"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Source field must not be null"
  },
  "status" : 400
}

It seems there was an attempt to document this, but being at the same level as groups and aggregations this line never comes (at least for me) as an expected replacement of the field parameter of terms, histogram or date_histogram:

Option Data Type Description Required
source_field String The field(s) to transform. Yes

Maybe it should be more useful and obvious if described as an extra line in groups description...

The target_field and its behavior (if not specified, it equals source_field from what I was able to test) should also be documented at the same place...

Cobraeti commented 1 year ago

That said, the date_histgramgrouping is still not really behaving as expected, as @timstamp is not allowed as source_field:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "status_exception",
        "reason" : "Cannot find field [@timestamp] that can be grouped as [date_histogram] in [opensearch_dashboards_sample_data_logs]."
      }
    ],
    "type" : "status_exception",
    "reason" : "Cannot find field [@timestamp] that can be grouped as [date_histogram] in [opensearch_dashboards_sample_data_logs]."
  },
  "status" : 400
}

And using other date fields found in the web logs data samples (timestamp or utc_time) just lead to target fields only recognized as number, so not available as timestamp field when creating an index pattern for visualization in Opensearch Dashboards, which is not expected...

The official Elasticsearch implementation is not only way more user-friendly (less alteriation required to build the transform job creation request), but also leads to the expected target_field type, which is obviously date when you aggregate dates.... maybe a date range would be another acceptable type, but not a number...

Cobraeti commented 1 year ago

Hello, any news about the date_histogram not allowing @timestamp as source field ? as this remains a bug related to date_histogram while used in transform jobs to me...

The documentation issue was a first wall I faced to try using it, but the key issue here is date_histogram not being able to use a date field as source... Kind of sad according to the name :sweat_smile:

kvitali commented 7 months ago

This is the supported format for date_histogram in transform job:

"date_histogram": { "source_field": "timestamp", "calendar_interval": "minute" }

Cobraeti commented 7 months ago

Hello @kvitali, Thanks for pointing the good format, though I won't be able to confirm it works, as we had to move to Elasticsearch since then... I guess this should be either: