opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
14 stars 23 forks source link

[BUG] Skipping index needs to support Array based table #355

Open YANG-DB opened 4 months ago

YANG-DB commented 4 months ago

What is the bug? When defining a skipping index on top of an array based data structure - the create skipping index fails:

Data Definition

[
  {
    "@timestamp": "2023-07-17T08:14:05.000Z",
    "event": {
      "result": "ACCEPT",
      "name": "cloud_trail",
      "domain": "cloudtrail"
    },
    "attributes": {
      "data_stream": {
        "dataset": "cloudtrail_log",
        "namespace": "production",
        "type": "cloud_trail_logs"
      }
    },
    "cloud": {
      "provider": "aws",
      "account": {
        "id": "111111111111"
      },
      "region": "ap-southeast-2",
      "resource_id": "vpc-0d4d4e82b7d743527",
      "platform": "aws_vpc"
    },
    "aws": {
   ....

Table Definition

CREATE EXTERNAL TABLE IF NOT EXISTS  {table_name} (
  Records ARRAY<STRUCT<
    eventVersion STRING,
    userIdentity STRUCT<
      type:STRING,
      principalId:STRING,
      arn:STRING,
      ....

skipping index Definition

CREATE SKIPPING INDEX ON {table_name} (
    `Records.userIdentity.principalId` BLOOM_FILTER,
    `Records.userIdentity.accountId` BLOOM_FILTER,
    `Records.userIdentity.userName` BLOOM_FILTER,
    `Records.sourceIPAddress` BLOOM_FILTER,
    `Records.eventId` BLOOM_FILTER,
    `Records.userIdentity.type` VALUE_SET,
    `Records.eventName` VALUE_SET,
    `Records.eventType` VALUE_SET,
    `Records.awsRegion` VALUE_SET
) WITH (
  ...
)

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Create a table as shown above
  2. Create a skipping index as shown above
  3. Spark returns Error

What is the expected behavior? Skipping index should be able to work on top of such Array based indices Utilize LATERAL VIEW explode({Array Field}) in some way

dai-chen commented 4 months ago

I assume this is for Flint skipping index? I modified the title to avoid confusion.