[FEATURE] Efficient storage of high cardinality data in materialized view

Is your feature request related to a problem?

When dealing with high cardinality data, especially with columns like source and destination IPs in VPC flow logs, materialized views (MVs) can become excessively large and inefficient, leading to significant performance and storage challenges. This results in both slow query performance and unnecessary storage overhead. I’m looking for solutions to reduce the storage footprint and optimize query performance for these high-cardinality fields, particularly for dashboard visualizations.

What solution would you like?

To efficiently handle high cardinality data in materialized views, the following approach can be implemented:

Approximate Aggregation: Employ approximate data structures to handle high cardinality data. Specifically, these stream-friendly SQL functions process data without requiring sorting and compress the data by storing approximate values, such as Top K by frequency, or Top K by sum. This approach allows efficient handling of large datasets while reducing storage requirements. However, there are a few key challenges that need to be addressed:
- Accuracy over aggregated windows: Approximate results are calculated for fine-grained windows, such as 1-minute intervals. However, dashboards often need to aggregate this data into larger windows, such as 5-minute intervals. Aggregating approximate results over these larger windows can introduce errors or reduce the precision of the data, making it harder to trust the insights.
- Overhead on small datasets: High cardinality columns are duplicate in each approximate aggregate result (Top K src_ip by count, Top K src_ip by bytes) rather than being used as dimension columns in the GROUP BY clause. This duplication increases the overhead, especially for small datasets, as the approximation structures must now store multiple instances of these columns, potentially inflating the data size and reducing the storage efficiency benefits.

What alternatives have you considered?

Here are several alternative methods for optimizing storage and performance in materialized views:

Iceberg-Cube Aggregation: This approach groups the source data by different dimension combinations (e.g., source IP, destination IP, source-destination IP pair) and selectively materializes only a subset of the cells in these large groups. By applying threshold such as Top K most frequent, it can decide which cells to materialize, significantly reducing storage while still capturing the most relevant data for querying.
Hybrid Aggregation with Direct Querying: This approach uses materialized views to aggregate low cardinality columns, optimizing storage and query performance. High cardinality columns are excluded from the materialized view to prevent data explosion. Instead, when queries require high cardinality filters, the raw table is queried directly.

Do you have any additional context?

(I) VPC Flow Logs Example

Take the VPC flow logs dataset as an example: high cardinality fields like source and destination IP pairs create significant storage challenges when using materialized views. Consider the following materialized view at the terabyte (TB) scale. After grouping, each 1-minute window can result in hundreds of millions of rows.

CREATE MATERIALIZED VIEW vpc_flow_log_mv
AS
SELECT
  window.start AS startTime,
  activity_name,
  src_endpoint.ip AS src_ip,
  dst_endpoint.ip AS dst_ip,
  COUNT(*) AS total_count,
  SUM(traffic.bytes) AS total_bytes,
  SUM(traffic.packets) AS total_packets
FROM vpc_flow_logs
GROUP BY
  TUMBLE(eventTime, '1 Day'),
  activity_name,
  src_endpoint.ip,
  dst_endpoint.ip

The proposed approaches aim to address this issue by significantly reducing the amount of data stored. For example, by storing only the Top 100 per group, either by filtering cells in a cube or using an approximate Top K function, the storage remains bounded, regardless of the original size after grouping.

(II) Optimizations in OpenSearch Index

While the current focus is on optimizing the materialized view (MV) output size, there are additional optimizations that can be applied within the OpenSearch index to further reduce storage size:

Disable _source: Disabling the _source field can reduce the index size by approximately 40%, though it comes with side effects such as the loss of certain search functionalities like highlighting. This approach sacrifices some flexibility for better storage efficiency.
Disable inverted index on non-dimension fields: If the index is primarily used to serve pre-canned dashboards, you can consider disabling the inverted index on non-dimension fields to save storage. This is particularly useful if these fields do not need to be used in filters, though they can still be used in aggregations.
Store by the most appropriate field type: Storing fields using the most fitting data types can significantly reduce the storage size. For example, instead of storing as a keyword field, using the ip type for IPv4 addresses which are very common in VPC flow log, CloudTrail, WAF dataset.
Change compression rate: Adjusting the compression settings in OpenSearch from the default to the best compression option can further reduce the index size, at the cost of slightly slower indexing speeds.

The last item can be configured in index_settings option while the first two will be configurable once support for https://github.com/opensearch-project/opensearch-spark/issues/772 is implemented.

Proof of Concept: Approximate Aggregation Approach

Goals

Verify the feasibility of the Approximate Aggregation MV approach and evaluate its impact on storage, performance and cost, specifically including:

Implement approximate aggregate functions such as approx_top_count and approx_top_sum.
Compare actual storage in the OpenSearch index between traditional aggregate MV and approximate aggregate MV.
Benchmark MV build and dashboard query performance between traditional aggregate MV and approximate aggregate MV.
Evaluate EMR-S job costs for building the materialized view using traditional aggregation versus approximate aggregation.

Design

Syntax:

CREATE MATERIALIZED VIEW vpc_flow_log_mv
AS
SELECT
  window.start AS startTime,
  activity,
  APPROX_TOP_COUNT(src_endpoint.ip, 100) AS top_k_src_ip_by_count,
  APPROX_TOP_COUNT(dst_endpoint.ip, 100) AS top_k_dst_ip_by_count,
  APPROX_TOP_SUM(src_endpoint.ip, 100) AS top_k_src_ip_by_sum,
  APPROX_TOP_SUM(dst_endpoint.ip, 100) AS top_k_dst_ip_by_sum,
  APPROX_TOP_COUNT(ARRAY(src_endpoint.ip, dst_endpoint.ip), 100) AS top_k_src_dst_ip_by_count,
  COUNT(*) AS total_count,
  SUM(traffic.bytes) AS total_bytes,
  SUM(traffic.packets) AS total_packets
FROM vpc_flow_logs
GROUP BY
  TUMBLE(eventTime, '1 Day'),
  activity_name

Materialized view data:

    "_source": {
          "startTime": "2024-10-01 12:00:00",
          "activity": "Traffic",
          "top_k_src_ip_by_count": [
            {
              "ip": "192.168.0.100",
              "count": 23205
            },
           "top_k_dst_ip_by_count": [
            {
              "ip": "127.0.01",
              "count": 238
            }
            ...
          ]
        },

OpenSearch DSL query:

POST /vpc_flow_log_approx_mv/_search
{
  "size": 0,
  "aggs": {
    "top_ips": {
      "nested": {
        "path": "top_k_src_ip_by_count"
      },
      "aggs": {
        "ip_buckets": {
          "terms": {
            "field": "top_k_src_ip_by_count.ip",
            "size": 100,
            "order": {
              "total_count": "desc"
            }
          },
          "aggs": {
            "total_count": {
              "sum": {
                "field": "top_k_src_ip_by_count.count"
              }
            }
          }
        }
      }
    }
  }
}

Implementation Tasks

Implement approx_top_count function: Create a function to compute approximate top K counts for high cardinality fields.
Implement approx_top_sum function: Develop a similar function for approximate top K sum calculations.
Support nested fields in MV output: Ensure the materialized view (MV) can output nested fields to store approximate aggregation results.
Create a dashboard on MV data: Build a dashboard for visualizing the results from the MV, using approximate aggregation for top K values.

Testing Tasks

Storage comparison: Compare the actual storage usage in the OpenSearch index for aggregate MV and approximate aggregate MV.
Performance benchmark: Measure and compare the performance of building and querying aggregate MV and approximate aggregate MV.
EMR-S cost evaluation: Evaluate the overall cost on EMR-S for building aggregate MV versus approximate aggregate MV.

opensearch-project / opensearch-spark