Open dai-chen opened 3 months ago
Create Spark table for CloudTrail logs. Ref: https://github.com/opensearch-project/dashboards-observability/blob/main/server/adaptors/integrations/__data__/repository/aws_cloudtrail/assets/create_table_cloud-trail-1.0.0.sql
CREATE TABLE cloudtrail_logs ( eventversion STRING, useridentity STRUCT< type:STRING, principalid:STRING, arn:STRING, ... requestparameters STRING, ... ) USING json PARTITIONED BY (region, year, month, day) LOCATION 's3://DOC-EXAMPLE-BUCKET/AWSLogs/Account_ID/CloudTrail/';
Create Flint secondary index:
CREATE INDEX request_params ON cloudtrail_logs ( CAST(requestparameters AS STRUCT) -- index entire JSON instead of single TEXT field ) WHERE eventtime BETWEEN CURRENT_TIMESTAMP - INTERVAL '30' DAY AND CURRENT_TIMESTAMP; -- partial indexing 30 days' logs # Use OpenSearch table behind the scene to solve capacity and read performance issue [TBD] CREATE TABLE flint_cloudtrail_logs_request_params ( requestparameters TYPE -- a) type is JSON? b) disable source/docValue, only need inverted index ) USING opensearch PARTITIONED BY (region, year, month, day) LOCATION 'http://...';
Query CloudTrail logs:
SELECT eventname, sourceipaddress, useridentity.arn FROM cloudtrail_logs WHERE eventtime BETWEEN CURRENT_TIMESTAMP - INTERVAL '7' DAY AND CURRENT_TIMESTAMP AND -- partition pruning eventname = 'CreatePolicy' AND JSON_EXTRACT(requestparameters, '$.policyName') = 'example-policy'; -- Flint secondary index kicks in
Is your feature request related to a problem?
Currently, Flint only supports a covering index for SparkSQL queries, which requires all columns present in the query to be indexed in the covering index. Moreover, the covering index rows do not have a direct reference to the source table rows. This approach works well for users who need the full search and dashboard capabilities of OpenSearch.
However, for users who primarily want to accelerate their queries, covering indexes can be inefficient. They may ingest many columns, leading to slower performance and excessive space consumption. This is particularly problematic for large datasets or for users who need to index all columns due to uncertainty about future queries.
What solution would you like?
Enhance the covering index to function as a generic secondary index that maintains a unique row ID for rows in the source table. The new behavior of the index should be as follows:
Key Benefits and Impacts
The proposed enhancement to the covering index functionality in Flint offers a range of key benefits and positive impacts across various aspects:
What alternatives have you considered?
N/A
Do you have any additional context?
Here is an example of how this enhancement could be used: