[FEATURE] Materialized views (aka virtual indexes) on object stores

elfisher commented 2 years ago

What / Why

What are you proposing?

Users store logs, event data, and other data in object stores, like S3, for analysis with batch-based analytics tooling (e.g., Spark). This pattern, often called a data lake, enables people to cost effectively and durably store data for analysis. As this relates to OpenSearch, we see people compliment their OpenSearch deployments with their data lakes for analysis. For example, some people choose to store real-time data in OpenSearch for real-time analytics and use their data lakes for longer-term historical analytics. However, we have also heard from users that they still want to be able to join both data sets for analysis. Because of this, we propose introducing the ability to create virtual indexes on data stored in object stores so that users can visualize in their data lake data alongside their OpenSearch data in OpenSearch Dashboards.

In order to accomplish this, we will introduce the ability to create virtual tables against data in object stores. From there, users will be able to create virtual indexes (aka materialized views) on the virtual tables and define the query used for generating the virtual index. OpenSearch will cache aggregated data into the virtual index and serve that data when users query against the virtual index. Users will be able to query this data as though it is regular OpenSearch data and visualize this data alongside their other OpenSearch data in OpenSearch Dashboards.

This proposal also aims to complement the existing storage vision by adding an option for querying data in object stores that is already stored in formats like Parquet without requiring reindexing the data into the Lucene format.

Which users have asked for this feature?

There have been multiple discussions in GitHub and the forums about using Object stores and features like UltraWarm to drive down costs. This proposal adds another option for querying data in object stores against formats the data might already be indexed in. Some example discussions include:

What is the developer experience going to be? Does this have a REST API? If so, please describe the API and any impact it may have to existing APIs. In a brief summary (not a spec), highlight what new REST APIs or changes to REST APIs are planned. as well as any other API, CLI or Configuration changes that are planned as part of this feature.

This feature will have both APIs and UX in OpenSearch Dashboards for configuring, querying, and visualizing data from object stores. The high-level MVP workflow will follow:

Configuring connection to an object store and bucket which will auto create a virtual table
Creating a materialized view (aka virtual index) of the virtual table to define aggregations on the data you want cached.
The ability to create a visualization on the materialized view using any of the existing visualizations
The ability to add the visualization to an existing or new dashboard.

Are there any security considerations? What is the security model of the new APIs? Features should be integrated into the OpenSearch security suite and so if they are not, we should highlight the reasons here.

Answer: Yes. This feature will have multiple security considerations:

Credentials to object stores must be securely stored and used.
This feature should adhere to the existing security capabilities of OpenSearch (e.g., Document-level security, index security, field-level security, and more).
There will need to be permissions for users to create data sources and create virtual indexes.

Are there any breaking changes to the API? If Yes, what is the path to minimizing impact? (example, add new API and deprecate the old one)

Answer: No. This will all be additive.

Are there breaking changes to the User Experience? Will this change the existing user experience? Will this be a breaking change from a user flow or user experience perspective?

Answer: No. This will all be additive.

What will it take to execute? Are there any assumptions you may be making that could limit scope or add limitations? Are there performance, cost, or technical constraints that may impact the user experience? Does this feature depend on other feature work? What additional risks are there?

There is risk around performance and stability of querying large datasets stored in object stores. In order to be successful we will need to benchmark for performance, reliability, and recommended caching configurations.

Any remaining open questions? What are known enhancements to this feature? Any enhancements that may be out of scope but that we will want to track long term? List any other open questions that may need to be answered before proceeding with an implementation.

The biggest area of ambiguity is how this will evolve with the multi-OpenSearch data source effort in OpenSearch Dashboards. The goal is to align UXs so that these are all managed and interacted with in similar ways. It is imperative that these experiences don't diverge as it will cause friction for users who want to integrate multiple data sources into OpenSearch Dashboards.

Beyond the initial release this feature can be enhanced by:

Integrating more data formats (e.g., JSON, CSV, other JDBC sources).
Improved caching mechanisms like prefetching based on Dashboard configurations and frequently accessed aggregations.
1. Note there may be optimization learnings from the storage vision projects.
For JDBC, potentially executing queries on the remote systems for higher performance/scale
Integrating the connection configuration into the admin panel effort
Integrating refresh or other virtual index operations into Index Management
Automatically spilling queries to the raw data source to reduce local cache size requirements.

User stories

[P0] As an administrator, I can create a connection to a bucket within an object store via the API and OpenSearch Dashboards administration experience. This will automatically create a virtual table inside of OpenSearch.
[P0] As a developer, I can define a materialized view, including aggregations on the raw fields, and represent that as a virtual index inside of OpenSearch. This can be done both via the API and OpenSearch Dashboards.
[P0] As a developer, when I create a virtual index in OpenSearch Dashboards, I will have the option to also create an index pattern for my virtual index.
[P0] As a developer or Dashboards user, I can use the existing Visualization feature to create visualizations on virtual indexes.
[P0] As a developer or Dashboards, I can add visualizations created with virtual indexes in existing or new dashboards that include data from non-virtual indexes.
[P0] The existing security features (DLS, FLS, etc..) are compatible with virtual indexes.
[P0] DSL, SQL, and PPL can be used to query virtual indexes
[P0] As a developer I can use this feature with data lake formats such as Orc and Parquet.
[P1] As a developer I can use this feature with any JDBC supported source.
[P2] As a developer I can use this feature with JSON.
[P2] As a developer I can use this feature with CSV.
[P3] As a developer I can use this feature with other popular data lake formats (e.g., XML)
[P0] As an administrator I can tune the caching options, refresh options, and other settings on materialized views via the API and OpenSearch Dashboards.
[P0] As a dashboards user I can drill down to see the raw object store data for a given window.
[P1] As a developer I can define a data enrichment policy as part of the materialized view definition
[P1] As a dashboards user/developer I can join/build a correlation with data from a virtual index with data stored in OpenSearch
[P2] As a Dashboards user, I am given a cue to see if a visualization is using a virtual index.

anirudha commented 2 years ago

feature demo at a community meeting : https://youtu.be/AiKb2sVexrA?list=PLzgr9zSpws15dkg6979WeKkLg4T-vLNeW&t=884

dblock commented 1 year ago

Are we reinventing/duplicating Hive?
Am I only going to be able to query these indices via SQL, with SQL plugin installed?

elfisher commented 1 year ago

Are we reinventing/duplicating Hive?

Am I only going to be able to query these indices via SQL, with SQL plugin installed?

This does not aim to be a reinvention of Hive. The goal is to make it easier to query/visualize data from object stores along side data in OpenSearch. The plan is to have virtual indexes queryable via DSL and SQL.

dblock commented 1 year ago

I think we should be able to mount all kinds of index formats in OpenSearch data nodes, including distributed ones, then query them. IMO this belongs in OpenSearch core.

elfisher commented 1 year ago

I think we should be able to mount all kinds of index formats in OpenSearch data nodes, including distributed ones, then query them. IMO this belongs in OpenSearch core.

Makes sense.

penghuo commented 1 year ago

I think we can define the virtual index and materialized view as following.

Virtual Index - A index which has meta data in OpenSearch and associate with source data remotely. A virtual index is similar to a view in a database which OpenSearch user can create/delete/query as they would in a index.
Materialized View - A materialized view contains a precomputed result set, based on an SQL query over one or more base indices.

After creating virtual index, OS user operate virtual index only on warm data node using existing index API. OSD user could create index pattern from virtual index and build visualization on it. Processing queries on virtual index / warm index can be expensive, in terms of system resources and the time it takes to compute the results. User could materialize virtual index, owarm index, even hot index as materialized view.

Virtual Index - Query object store

We proposed a solution that leverages Lucene to read Parquet format data on an object store and use metadata information as a secondary index (skipping index) to filter Parquet files. Users ingest Parquet datasets to S3 and put metadata information to a virtual index. The virtual index works the same as an ordinary index, but only supports docValues-related operations. The virtual index can only be queried in a warm node. The virtual index is append-only and does not support doc update and delete operations. For implementation, we leverage existing ultrawarm caching, optimization, and prefetch technologies.

Materialized View - Auto

The Auto MV is the component responsible for creating and maintaining materialized views automatically. Internally, the Auto MV works as a streaming query which pulls data from the source, processes it, and finally writes to the sink. The source and sink of Auto MV could be any index (Hot, Warm and Virtual).

User Experience - API User

Create the virtual index in OpenSearch.

PUT myIndex
{
  "settings": {
    "index": {
      "number_of_shards": 3,
      "number_of_replicas": 1
    },
    "source": {
      "type": "virtual",
      "location": "s3://maximus.dev.us-west-2/http_log",
      "format": "parquet",
      "compression": "gz"
    }
  },
  "mappings": {
    "_meta": {
      "path": {
        "type": "keyword"
      }
    },
    "properties": {
      "date": {
        "tag": "true"
        "type": "date"
      },
      "region": {
        "tag": "true"
        "type": "keyword"
      },
      "@timestamp": {
        "type": "date"
      },
      "clientip": {
        "type": "keyword"
      },
      "request": {
        "type": "keyword"
      },
      "size": {
        "type": "integer"
      },
      "status": {
        "type": "integer"
      }
    }
  }
}

Upload data to S3 bucket.

 s3://maximus.dev.us-west-2/http_log/2022/12/01/log.1.parquet

Put meta data to virtual index.

POST myIndex/_doc/
{
  "path": "s3://maximus.dev.us-west-2/http_log/2022/12/01/log.1.parquet"
  "date": "2022-12-01",
  "region": "iad"
}

Query virtual index.

# DSL
GET myIndex/_search
{
  "query": {
    "term": {
      "region": "iad"        // only tag field could be used in filter?
    }
  },
  "agg": {
    "reqcount": {
      "terms": {
        "field": "host"
      }
    }
  }
}

# SQL
POST _plugins/_sql
{
  "query": "
     SELECT count() as reqcount 
     FROM myIndex 
     WHERE region = 'iad' 
     GROUP BY host
  "
}

Create Auto materialized view.

CREATE MATERIALIZED VIEW **reqCnt** (
  cnt    long,
  time   timestamp,
  status string
)
AS source=myIndex | status count() as cnt by span(5mins), host
WITH (
    REFRESH=AUTO
)

Query materialized view

POST _plugins/_sql
{
  "query": "SELECT * FROM reqCnt"
}

User Experience

OSD users can create visualizations using virtual indices, just like hot/warm indices. In Phase I release, OSD users can choose to materialize the visualizations for building low-latency dashboards. In Phase 2 release, the OpenSearch query engine could leverage the query stats to create and manage materialized views automatically without user input.

sathishbaskar commented 1 year ago

Are there any hard limits to be careful about? Object stores suggest a possibility for infinite (or really large) storage at relatively insignificant cost. But what does it mean for the size and handling of cluster state? Would a cluster connected to an object store keep track of all available segments in the object store?

brijos commented 1 year ago

@sathishbaskar we agree with your concerns. Please find https://github.com/opensearch-project/opensearch-spark/issues/4 as how we plan to address your concerns. We would love your feedback.

opensearch-project / sql