opensearch-project / sql

Query your data using familiar SQL or intuitive Piped Processing Language (PPL)
https://opensearch.org/docs/latest/search-plugins/sql/index/
Apache License 2.0
116 stars 134 forks source link

[FEATURE] Materialized views (aka virtual indexes) on object stores #1080

Open elfisher opened 1 year ago

elfisher commented 1 year ago

What / Why

What are you proposing?

Users store logs, event data, and other data in object stores, like S3, for analysis with batch-based analytics tooling (e.g., Spark). This pattern, often called a data lake, enables people to cost effectively and durably store data for analysis. As this relates to OpenSearch, we see people compliment their OpenSearch deployments with their data lakes for analysis. For example, some people choose to store real-time data in OpenSearch for real-time analytics and use their data lakes for longer-term historical analytics. However, we have also heard from users that they still want to be able to join both data sets for analysis. Because of this, we propose introducing the ability to create virtual indexes on data stored in object stores so that users can visualize in their data lake data alongside their OpenSearch data in OpenSearch Dashboards.

In order to accomplish this, we will introduce the ability to create virtual tables against data in object stores. From there, users will be able to create virtual indexes (aka materialized views) on the virtual tables and define the query used for generating the virtual index. OpenSearch will cache aggregated data into the virtual index and serve that data when users query against the virtual index. Users will be able to query this data as though it is regular OpenSearch data and visualize this data alongside their other OpenSearch data in OpenSearch Dashboards.

This proposal also aims to complement the existing storage vision by adding an option for querying data in object stores that is already stored in formats like Parquet without requiring reindexing the data into the Lucene format.

Which users have asked for this feature?

There have been multiple discussions in GitHub and the forums about using Object stores and features like UltraWarm to drive down costs. This proposal adds another option for querying data in object stores against formats the data might already be indexed in. Some example discussions include:

What is the developer experience going to be? Does this have a REST API? If so, please describe the API and any impact it may have to existing APIs. In a brief summary (not a spec), highlight what new REST APIs or changes to REST APIs are planned. as well as any other API, CLI or Configuration changes that are planned as part of this feature.

This feature will have both APIs and UX in OpenSearch Dashboards for configuring, querying, and visualizing data from object stores. The high-level MVP workflow will follow:

  1. Configuring connection to an object store and bucket which will auto create a virtual table
  2. Creating a materialized view (aka virtual index) of the virtual table to define aggregations on the data you want cached.
  3. The ability to create a visualization on the materialized view using any of the existing visualizations
  4. The ability to add the visualization to an existing or new dashboard.

Are there any security considerations? What is the security model of the new APIs? Features should be integrated into the OpenSearch security suite and so if they are not, we should highlight the reasons here.

Answer: Yes. This feature will have multiple security considerations:

  1. Credentials to object stores must be securely stored and used.
  2. This feature should adhere to the existing security capabilities of OpenSearch (e.g., Document-level security, index security, field-level security, and more).
  3. There will need to be permissions for users to create data sources and create virtual indexes.

Are there any breaking changes to the API? If Yes, what is the path to minimizing impact? (example, add new API and deprecate the old one)

Answer: No. This will all be additive.

Are there breaking changes to the User Experience? Will this change the existing user experience? Will this be a breaking change from a user flow or user experience perspective?

Answer: No. This will all be additive.

What will it take to execute? Are there any assumptions you may be making that could limit scope or add limitations? Are there performance, cost, or technical constraints that may impact the user experience? Does this feature depend on other feature work? What additional risks are there?

There is risk around performance and stability of querying large datasets stored in object stores. In order to be successful we will need to benchmark for performance, reliability, and recommended caching configurations.

Any remaining open questions? What are known enhancements to this feature? Any enhancements that may be out of scope but that we will want to track long term? List any other open questions that may need to be answered before proceeding with an implementation.

The biggest area of ambiguity is how this will evolve with the multi-OpenSearch data source effort in OpenSearch Dashboards. The goal is to align UXs so that these are all managed and interacted with in similar ways. It is imperative that these experiences don't diverge as it will cause friction for users who want to integrate multiple data sources into OpenSearch Dashboards.

Beyond the initial release this feature can be enhanced by:

  1. Integrating more data formats (e.g., JSON, CSV, other JDBC sources).
  2. Improved caching mechanisms like prefetching based on Dashboard configurations and frequently accessed aggregations.
    1. Note there may be optimization learnings from the storage vision projects.
  3. For JDBC, potentially executing queries on the remote systems for higher performance/scale
  4. Integrating the connection configuration into the admin panel effort
  5. Integrating refresh or other virtual index operations into Index Management
  6. Automatically spilling queries to the raw data source to reduce local cache size requirements.

User stories

anirudha commented 1 year ago

feature demo at a community meeting : https://youtu.be/AiKb2sVexrA?list=PLzgr9zSpws15dkg6979WeKkLg4T-vLNeW&t=884

dblock commented 1 year ago
elfisher commented 1 year ago
  • Are we reinventing/duplicating Hive?
  • Am I only going to be able to query these indices via SQL, with SQL plugin installed?

This does not aim to be a reinvention of Hive. The goal is to make it easier to query/visualize data from object stores along side data in OpenSearch. The plan is to have virtual indexes queryable via DSL and SQL.

dblock commented 1 year ago

I think we should be able to mount all kinds of index formats in OpenSearch data nodes, including distributed ones, then query them. IMO this belongs in OpenSearch core.

elfisher commented 1 year ago

I think we should be able to mount all kinds of index formats in OpenSearch data nodes, including distributed ones, then query them. IMO this belongs in OpenSearch core.

Makes sense.

penghuo commented 1 year ago

I think we can define the virtual index and materialized view as following.

After creating virtual index, OS user operate virtual index only on warm data node using existing index API. OSD user could create index pattern from virtual index and build visualization on it. Processing queries on virtual index / warm index can be expensive, in terms of system resources and the time it takes to compute the results. User could materialize virtual index, owarm index, even hot index as materialized view.

Screen Shot 2022-12-12 at 5 22 58 PM

Virtual Index - Query object store

We proposed a solution that leverages Lucene to read Parquet format data on an object store and use metadata information as a secondary index (skipping index) to filter Parquet files. Users ingest Parquet datasets to S3 and put metadata information to a virtual index. The virtual index works the same as an ordinary index, but only supports docValues-related operations. The virtual index can only be queried in a warm node. The virtual index is append-only and does not support doc update and delete operations. For implementation, we leverage existing ultrawarm caching, optimization, and prefetch technologies.

Materialized View - Auto

The Auto MV is the component responsible for creating and maintaining materialized views automatically. Internally, the Auto MV works as a streaming query which pulls data from the source, processes it, and finally writes to the sink. The source and sink of Auto MV could be any index (Hot, Warm and Virtual).

User Experience - API User

PUT myIndex
{
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    },
    "source": {
      "type": "virtual",
      "location": "s3://maximus.dev.us-west-2/http_log",
      "format": "parquet",
      "compression": "gz"
    }
  },
  "mappings": {
    "_meta": {
      "path": {
        "type": "keyword"
      }
    },
    "properties": {
      "date": {
        "tag": "true"
        "type": "date"
      },
      "region": {
        "tag": "true"
        "type": "keyword"
      },
      "@timestamp": {
        "type": "date"
      },
      "clientip": {
        "type": "keyword"
      },
      "request": {
        "type": "keyword"
      },
      "size": {
        "type": "integer"
      },
      "status": {
        "type": "integer"
      }
    }
  }
}
 s3://maximus.dev.us-west-2/http_log/2022/12/01/log.1.parquet
POST myIndex/_doc/
{
  "path": "s3://maximus.dev.us-west-2/http_log/2022/12/01/log.1.parquet"
  "date": "2022-12-01",
  "region": "iad"
}
# DSL
GET myIndex/_search
{
  "query": {
    "term": {
      "region": "iad"        // only tag field could be used in filter?
    }
  },
  "agg": {
    "reqcount": {
      "terms": {
        "field": "host"
      }
    }
  }
}

# SQL
POST _plugins/_sql
{
  "query": "
     SELECT count() as reqcount 
     FROM myIndex 
     WHERE region = 'iad' 
     GROUP BY host
  "
}
CREATE MATERIALIZED VIEW **reqCnt** (
  cnt    long,
  time   timestamp,
  status string
)
AS source=myIndex | status count() as cnt by span(5mins), host
WITH (
    REFRESH=AUTO
)
POST _plugins/_sql
{
  "query": "SELECT * FROM reqCnt"
}

User Experience

OSD users can create visualizations using virtual indices, just like hot/warm indices. In Phase I release, OSD users can choose to materialize the visualizations for building low-latency dashboards. In Phase 2 release, the OpenSearch query engine could leverage the query stats to create and manage materialized views automatically without user input.

Screen Shot 2022-12-13 at 11 50 55 AM
sathishbaskar commented 1 year ago

Are there any hard limits to be careful about? Object stores suggest a possibility for infinite (or really large) storage at relatively insignificant cost. But what does it mean for the size and handling of cluster state? Would a cluster connected to an object store keep track of all available segments in the object store?

brijos commented 1 year ago

@sathishbaskar we agree with your concerns. Please find https://github.com/opensearch-project/opensearch-spark/issues/4 as how we plan to address your concerns. We would love your feedback.