[FEATURE] Accelerating Data Lake Analytics with Flint

Is your feature request related to a problem?

Data lake table formats like Delta, Iceberg, and Hudi leverage Parquet files as the underlying storage and maintain built-in table statistics. While these built-in mechanisms are often sufficient for meeting query performance requirements, there are scenarios where these table formats have functional and performance limitations:

Full Text Search
- Description: Perform advanced full-text search operations on text-based fields, such as log messages or document content.
- Example: SELECT * FROM cloudtrail_logs WHERE requestparameters LIKE '%DeleteBucket%'. This query performs a substring search on the requestparameters column to find events related to deleting S3 buckets, which may not be efficient for large datasets or complex text search requirements.
Complex Filtering Conditions
- Description: Efficiently filter data based on complex conditions involving nested fields, ranges, or combinations of multiple fields.
- Example: SELECT * FROM vpc_flow_logs WHERE dstaddr LIKE '10.0.%' AND dstport = 80 AND protocol = 6. This query filters on the destination IP address, destination port, and protocol fields, which may not be efficient if the data is not well-partitioned or indexed.
Advanced Field Types
- Description: Support for indexing and querying advanced data types, which are not natively supported in data lake table formats.
- Example: IP addresses, geospatial data, vector data, etc.
Real-time Data Analysis
- Description: Enable low-latency indexing and querying of data for real-time or near-real-time analytics use cases.
- Example: SELECT COUNT(*) FROM alb_logs WHERE time > CURRENT_TIMESTAMP() - INTERVAL 1 MINUTE. This query counts the number of ALB log events in the last minute. However, it may not provide real-time results if the data is not immediately available in the table.
Unstructured Data Analysis
- Description: Efficiently index and query semi-structured or unstructured data, such as JSON or XML documents, which may not be optimally handled by data lake table formats designed for structured data.
- Example: SELECT json_extract(requestheaders, '$.User-Agent') AS user_agent FROM waf_logs. This query extracts the User-Agent value from the requestheaders JSON column. However, data lake table and Parquet are primarily designed for structured data, and querying unstructured data may not be efficient.
Table Statistics are Big Data
- Description: Handle scenarios where the table statistics or metadata themselves become large, making it inefficient to rely solely on the built-in mechanisms of data lake table formats.
- Example: Consider a data lake table storing billions of rows of log data. As the volume of data grows, the size of the table statistics and metadata can become overwhelming. In such scenarios, Flint can maintain indexing on these huge table statistics in an asynchronous way and "cache" them in OpenSearch.

What solution would you like? [TBD]

The following are different approaches that can be combined to address the various problems listed above, rather than exclusive solutions to choose one:

Load table data into OpenSearch by Flint for scenarios where table doesn't natively support certain data types (IP addresses, geospatial data, vector data, etc.), leveraging OpenSearch's advanced data type support and indexing capabilities.
Generate secondary indexes within table, allowing efficient querying without the need for external indexing systems like OpenSearch.
Extend table with custom data types and indexing strategies that integrate with Flint for efficient indexing and querying of specific data types or use cases.
Adopt a hybrid approach: use table for structured data processing and Flint for advanced indexing and querying capabilities, leveraging the strengths of both systems while minimizing the need for custom development or extensions.

What alternatives have you considered?

N/A

Do you have any additional context?

Apart from addressing the limitations and challenges faced by data lake table formats, integrating with Flint (OpenSearch) can unlock additional benefits and capabilities:

Approximate Query Processing: Leverage OpenSearch's approximate query processing capabilities to enable faster and more interactive analysis of large datasets stored in data lake table formats, trading off some accuracy for improved query performance.
OpenSearch Dashboarding: The integration of OpenSearch with data lake table formats can also facilitate the creation of interactive dashboards and visualizations, leveraging OpenSearch's dashboarding capabilities to provide a comprehensive data exploration and analysis solution.

opensearch-project / opensearch-spark

[FEATURE] Accelerating Data Lake Analytics with Flint #367