Open dai-chen opened 2 weeks ago
This PoC aims to evaluate the integration of Flint's core functionalities, including skipping index, covering index and materialized views, with Apache Iceberg tables.
The key questions to answer:
This query demonstrates finding the top IP address pairs (source -> target) that have had their connections rejected in the past hour:
-- Identify the top IP address pairs with rejected connections in the last hour
SELECT
src_endpoint.ip || '->' || dst_endpoint.ip AS ip_pair,
action,
COUNT(*) AS count
FROM vpc_flow_logs
WHERE action = 'REJECT'
AND time_dt > (current_timestamp - interval '1' hour)
GROUP BY 1, 2
ORDER BY count DESC
LIMIT 25;
This part of the PoC explores the implementation of advanced search capabilities in Flint, integrated with Iceberg tables. Take full-text search capability for example, below is a demonstration query:
-- Identify the number of HTTP status occurrences with requests containing 'Chrome' in the past hour
SELECT
status,
COUNT(*) AS count
FROM http_logs
WHERE MATCH(request, 'Chrome')
AND timestamp > (current_timestamp - interval '1' hour)
GROUP BY status;
Changes required:
Is skipping index helpful in this case?
Depending on latency requirement, it is possible to build skipping index such as:
Alternative to current enhanced covering index solution?
Covering indexes provide full search and dashboard capabilities, but it require indexing of all involved columns. An alternative is a non-covering (filtering) index, which indexes only the columns used in filters, with each index entry pointing back to the row ID in the source table.
To implement this, we need to answer the following questions:
The following SQL examples illustrate how Flint can be leveraged to accelerate queries against Iceberg's metadata, which is essential for schema management and data governance:
-- Example query to fetch historical metadata from an Iceberg table
TODO
Skipping Index
Covering Index
Materialized View
Index Maintenance
Here we outline the end-to-end user experience, demonstrating the steps involved from initial data exploration through advanced query optimization and table management.
# Step 1: Data exploration SELECT src_endpoint, dst_endpoint, action FROM glue.iceberg.vpc_flow_logs -- limit size or sampling LIMIT 10; # Step 2: Zero-ETL by Flint index CREATE INDEX src_dst_action ON glue.iceberg.vpc_flow_logs ( src_endpoint, dst_endpoint, action ) WHERE timestamp > (current_timestamp - interval '1' hour) -- partial indexing WITH ( auto_refresh = true ); # Step 3a: Dashboard / DSL query Flint index directly POST flint_glue_iceberg_vpc_flow_logs_src_dst_action_index { ... } # Step 3b: SparkSQL query acceleration # Identify the top IP address pairs with rejected connections in the last hour SELECT src_endpoint.ip || '->' || dst_endpoint.ip AS ip_pair, action, COUNT(*) AS count FROM glue.iceberg.vpc_flow_logs WHERE action = 'REJECT' AND time_dt > (current_timestamp - interval '1' hour) GROUP BY 1, 2 ORDER BY count DESC LIMIT 25; # Step 4: Iceberg table management # Data compaction on a regular basis triggered manually or by Glue CALL local.system.rewrite_data_files( table => 'glue.iceberg.vpc_flow_logs', options => map('rewrite-all', true) ); # Step 5: Clean up # User deletes unused covering index after analytics DELETE INDEX src_dst_action ON glue.iceberg.vpc_flow_logs; VACUUM INDEX src_dst_action ON glue.iceberg.vpc_flow_logs;
Here is the architecture diagram that provides a comprehensive overview of the high-level design and key components:
Here presents the high-level task breakdown, providing a description of each task and its respective components. Please find more detailed task descriptions in the following sections:
Feature | Component | Priority | Task | Github Issue | Comment |
---|---|---|---|---|---|
Data Exploration | Catalog | High | Add Iceberg catalog config in Spark job params | todo | |
Data Types | High | Support all Iceberg data types in direct query | todo | ||
Zero-ETL | Covering Index | Med | Map source column to OpenSearch field type | #384 | OpenSearch table design related |
Med | Fix single OS index capacity issue | #339 | |||
High | Improve Flint data source reader performance | #334 | |||
Materialized View | Low | Support event time ordering when cold start | #90 | ||
SparkSQL Query Acceleration | Skipping Index | Med | Disable skipping index create on Iceberg table | todo | |
Covering Index | High | Query rewrite with partial covering index | #298 | ||
Med | Add more filtering condition pushdown | #148 | OpenSearch table design related | ||
Materialized View | Low | Query rewrite with materialized view | todo | ||
Index Advisor | Low | Disable skipping index advisor on Iceberg table | todo | ||
Index Maintenance | Index Data Freshness | Low | Index refresh idempotency | #88 | |
Med | Include refresh status in show Flint index statement | #385 | |||
Low | Support hybrid scan for covering index | #386 | |||
Index Management | Low | Support schema change in alter index statement | #387 |
Users can execute common DDL statement and direct SQL queries on Iceberg tables for ad-hoc data analytics. Flint must support the Iceberg catalog and fully accommodate Iceberg data types, ensuring seamless integration and comprehensive data analysis capabilities.
Configure Spark job parameters to activate the Iceberg catalog, ensuring compatibility with FlintDelegatingSessionCatalog
. Ref: https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-iceberg.html
Flint must fully support all Iceberg data types, including complex structures like Struct, List, and Map, to ensure comprehensive data handling capabilities. Ref: https://iceberg.apache.org/docs/latest/spark-getting-started/#type-compatibility
Spark | Iceberg | Notes |
---|---|---|
boolean | boolean | |
short | integer | |
byte | integer | |
integer | integer | |
long | long | |
float | float | |
double | double | |
date | date | |
timestamp | timestamp with timezone | |
timestamp_ntz | timestamp without timezone | |
char | string | |
varchar | string | |
string | string | |
binary | binary | |
decimal | decimal | |
struct | struct | |
array | list | |
map | map |
Users can load raw or aggregated data directly into OpenSearch via covering indexes and materialized views, enabling full-text search and dashboard capabilities without the need for an Extract, Transform, Load (ETL) process.
Addressing limitations and improving the performance of covering indexes in issues below:
Addressing limitations of materialized views:
Users continue to use the familiar SparkSQL interface and leverage OpenSearch's indexing capabilities to accelerate SparkSQL queries.
ANALYZE
index statement for skipping index recommendations.Support query rewriting for covering index (full or partial) and materialized view:
Provides tool for user to inspect index data freshness and ensure up-to-date query results:
Functional testing ensures Iceberg support works with all existing components and features, and newly added features perform correctly and meet the specified requirements.
Category | Priority | Use Case | Test Parameters |
---|---|---|---|
Table Management | High | Create Iceberg table | |
Med | Create Spark data source table | ||
Direct Query | High | Query Iceberg table |
|
Med | Query Spark data source table |
|
|
Skipping Index | High | Build skipping index from Iceberg table |
|
High | Accelerate Iceberg table query with skipping index |
|
|
Covering Index | High | Build covering index from Iceberg table |
|
High | Accelerate Iceberg table query with covering index | ||
Materialized View | High | Build materialized view from Iceberg table |
|
High | Accelerate Iceberg table query with materialized view |
|
|
Index Management | Med | Show Flint indexes on Iceberg table | |
Med | Describe Flint index on Iceberg table | ||
Med | Alter Flint index on Iceberg table |
|
|
Med | Drop and vacuum Flint index on Iceberg table | ||
Med | Recover index job for Flint index on Iceberg table | ||
Index Advisor | Low | Recommend skipping index on Iceberg table |
Benchmarking performance for data exploration queries, zero-ETL ingestion, and SparkSQL query acceleration:
Issues related:
Is your feature request related to a problem?
Apache Iceberg is designed for managing large analytic tables in a scalable and performant way, using features like schema evolution, partitioning, and metadata management to optimize query performance. Despite these robust optimizations, the inherent latency of querying large datasets directly from S3 can be a pain point, especially for real-time analytics and interactive querying scenarios, when running complex or frequently accessed queries on large Iceberg tables.
TODO: current problem statement is more technical. need more feedback from real Iceberg customer.
What solution would you like?
Integrate current Flint’s query acceleration features with Iceberg to enhance performance:
TODO: evaluate missing features in https://github.com/opensearch-project/opensearch-spark/issues/367
What alternatives have you considered?
N/A
Do you have any additional context?
Known issues related: