opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
18 stars 28 forks source link

[BUG] Auto-refresh materialized view / covering index doesn't ingest all the data within the specified timeframe #789

Open A-Gray-Cat opened 1 day ago

A-Gray-Cat commented 1 day ago

What is the bug? When I was creating an auto-refresh materialized view, the statement I used included a where clause that looks for all the logs in the past '1' day. However, I noticed that each refresh didn't ingest logs all the way back within the specified timeframe, and only partial logs were ingested.

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Create a materialized view / covering index and set auto_refresh=true using a statement similar to this:
CREATE MATERIALIZED VIEW testing_auto_mv AS
    SELECT time_dt,
        actor,
        accountid,
        region,
        src_endpoint,
        api,
        http_request,
        is_mfa,
        class_uid,
        resources
     FROM securitylake.amazon_security_lake_glue_db_us_east_1.amazon_security_lake_table_us_east_1_cloud_trail_mgmt_2_0
     WHERE time_dt BETWEEN CURRENT_TIMESTAMP - INTERVAL '1' DAY AND CURRENT_TIMESTAMP
        AND accountid in ('{account id}')
        AND region = 'us-east-1' 
WITH ( auto_refresh = true, refresh_interval = '15 Minute', checkpoint_location = 's3://{bucket name}/AWSLogs/checkpoint_test')
  1. Create a materialized view and set auto_refresh=false using the same statement:

    CREATE MATERIALIZED VIEW testing_manual_mv AS
    SELECT time_dt,
        actor,
        accountid,
        region,
        src_endpoint,
        api,
        http_request,
        is_mfa,
        class_uid,
        resources
     FROM securitylake.amazon_security_lake_glue_db_us_east_1.amazon_security_lake_table_us_east_1_cloud_trail_mgmt_2_0
     WHERE time_dt BETWEEN CURRENT_TIMESTAMP - INTERVAL '1' DAY AND CURRENT_TIMESTAMP
        AND accountid in ('{account id}')
        AND region = 'us-east-1' 
    WITH ( auto_refresh = false)

    And run REFRESH testing_manual_mv

  2. Run the following SQL query against both MVs you just created, and compare the earliest timestamp:

    SELECT time_dt FROM flint_index ORDER BY time_dt LIMIT 1

    You can also check the difference of number of documents that are ingested in both MVs.

What is the expected behavior? The auto-refresh MV should ingest all the logs within the specified timeframe

What is your host/environment?

Do you have any screenshots? If applicable, add screenshots to help explain your problem.

Do you have any additional context? Add any other context about the problem.

dai-chen commented 1 day ago

Could you help quick check the oldest time_dt ingested in MV data? Just wonder if any time zone issue here.

A-Gray-Cat commented 1 day ago

@dai-chen Yes. The MV was created around 12:09PM PDT, and the oldest time_dt value in the MV is 2024-10-17 13:07:31 in UTC I would assume.