opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
14 stars 23 forks source link

[FEATURE] Accelerating Data Lake Analytics with Flint #367

Open dai-chen opened 3 months ago

dai-chen commented 3 months ago

Is your feature request related to a problem?

Data lake table formats like Delta, Iceberg, and Hudi leverage Parquet files as the underlying storage and maintain built-in table statistics. While these built-in mechanisms are often sufficient for meeting query performance requirements, there are scenarios where these table formats have functional and performance limitations:

  1. Full Text Search

    • Description: Perform advanced full-text search operations on text-based fields, such as log messages or document content.
    • Example: SELECT * FROM cloudtrail_logs WHERE requestparameters LIKE '%DeleteBucket%'. This query performs a substring search on the requestparameters column to find events related to deleting S3 buckets, which may not be efficient for large datasets or complex text search requirements.
  2. Complex Filtering Conditions

    • Description: Efficiently filter data based on complex conditions involving nested fields, ranges, or combinations of multiple fields.
    • Example: SELECT * FROM vpc_flow_logs WHERE dstaddr LIKE '10.0.%' AND dstport = 80 AND protocol = 6. This query filters on the destination IP address, destination port, and protocol fields, which may not be efficient if the data is not well-partitioned or indexed.
  3. Advanced Field Types

    • Description: Support for indexing and querying advanced data types, which are not natively supported in data lake table formats.
    • Example: IP addresses, geospatial data, vector data, etc.
  4. Real-time Data Analysis

    • Description: Enable low-latency indexing and querying of data for real-time or near-real-time analytics use cases.
    • Example: SELECT COUNT(*) FROM alb_logs WHERE time > CURRENT_TIMESTAMP() - INTERVAL 1 MINUTE. This query counts the number of ALB log events in the last minute. However, it may not provide real-time results if the data is not immediately available in the table.
  5. Unstructured Data Analysis

    • Description: Efficiently index and query semi-structured or unstructured data, such as JSON or XML documents, which may not be optimally handled by data lake table formats designed for structured data.
    • Example: SELECT json_extract(requestheaders, '$.User-Agent') AS user_agent FROM waf_logs. This query extracts the User-Agent value from the requestheaders JSON column. However, data lake table and Parquet are primarily designed for structured data, and querying unstructured data may not be efficient.
  6. Table Statistics are Big Data

    • Description: Handle scenarios where the table statistics or metadata themselves become large, making it inefficient to rely solely on the built-in mechanisms of data lake table formats.
    • Example: Consider a data lake table storing billions of rows of log data. As the volume of data grows, the size of the table statistics and metadata can become overwhelming. In such scenarios, Flint can maintain indexing on these huge table statistics in an asynchronous way and "cache" them in OpenSearch.

What solution would you like? [TBD]

The following are different approaches that can be combined to address the various problems listed above, rather than exclusive solutions to choose one:

What alternatives have you considered?

N/A

Do you have any additional context?

Apart from addressing the limitations and challenges faced by data lake table formats, integrating with Flint (OpenSearch) can unlock additional benefits and capabilities: