opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
110 stars 129 forks source link

[RFC] Query S3 in OpenSearch Observability #719

Open penghuo opened 1 year ago

penghuo commented 1 year ago

1.Overview

In this document, we will propose a solution in OpenSearch Observability to query log data stored in S3.

1.1.Problem Statements

Currently, OpenSearch Observability is collection of plugins and applications that let you visualize data-driven events by using Piped Processing Language to explore, discover, and query data stored in OpenSearch. The major requirements we heard from customer are

Can build a new solution for OpenSearch observably uses and leverage S3 as storage. The benefits are

With all these benefits, are there any concerns? The ability to query S3 in OpenSearch and query performance are the major concerns. In this doc, we will provide the solution to solve these two major concerns.

2.Terminology

3.Requirements

3.1.Use Cases

Use Case - 1: pre-build and query metrics from log on s3

Create

settings.s3.access_key: xxxxx
settings.s3.secret_key: xxxxx
CREATE EXTERNAL TABLE `s3`.`httplog`(
   @timestamp timestamp,
   clientip string,
   request string,
   state integer,
   size long
   )
ROW FORMAT SERDE
   'json'
   'grok', <timestamp> <className>
PARTITION BY
   ${region}/${year}/${month}/${day}/
LOCATION
   's3://my-raw-httplog/';
CREATE MATERIALIZED VIEW failEventsMetrics (
  cnt    long,
  time   timestamp,
  status string
)
AS source=`s3`.`httpLog` status != 200 | status count() as cnt by span(5mins), status
WITH (
    REFRESH=AUTO
)
DESCRIBE/SHOW MATERIALIZED VIEW failEventsMetrics

# Return
status: INIT | IN_PROGRESS | READY

Query

source=failEventsMetrics time in last 7 days
source=`s3`.`httpLog` status != 200 | status count() as cnt by span(1hour), status

Drop

DROP TABLE `s3`.`httpLog`
DROP MATERIALIZED VIEW failEventsMetrics

Access Control

Use Case - 2: Ad-hoc s3 query in OpenSearch

Create

Query

source=`s3`.`httpLog` status != 200 | status count() as cnt by span(5mins), status

List

LIST VIEW on `s3`.`httpLog`

# return
httplog-tempView-xxxx

Access Control

3.2.Function Requirements

Query S3

View

Query Optimization

Automatic query acceleration

S3 data format

Resource Management

Fault Tolerant

Setting

3.3.Non Function Requirements

Security:

There are three types of privileges that are related to materialized views

materialized view itself

objects (e.g. table) that the materialized view accesses

table

objects (e.g. table) that the table refer

Encryption

Others

4.What is, and what is not

muralikpbhat commented 1 year ago

This is exciting. How do we plan to implement the query on S3 row data without inverted index? Are we thinking of something like what CWLogs Insights or Athena does in AWS?