Open penghuo opened 1 year ago
This enables spark as a compute connector to opensearch data. correct ? can we set this up as a remote compute connection similar to a data source ?
This enables spark as a compute connector to opensearch data. correct ?
Yes. But not limit to OpenSearch data.
can we set this up as a remote compute connection similar to a data source ?
It is one option, but i feel we should make it more generic. ML could also leverage Spark as computation engine.
query from spark cluster to index is SQL ?
Will opensearch SQL engine be responsible for analyzing the query and dispatching all the queries to the MPP engine ? or it will have the ability to do parts of the query itself (for opensearh indices) and other parts delegate to spark ? will this require adding rules to Catalyst ?
Just some thoughts for discussion and PoC later: we need to verify and confirm the role of Spark RDD (with/without Spark SQL) in OpenSearch:
As discussed, Spark SQL and RDD is only for purpose # 1 above. Leveraging it for querying OpenSearch index is a totally different story and not our current goal. So in this case, the question for introducing Spark SQL is: whether we need it for Spark RDD job optimizing and planning to query object store.
Implementation options:
Research items:
Amazing stuff!
How will you support filtering (eg. timestamp ranges and/or keywords) in relation to S3 path schema.
For example - if using fluentbit's S3 output with s3_keyformat /$TAG[2]/$TAG[0]/%Y/%m/%d/%H%M_%S/$UUID.gz how will we map a keyword to pull objects with only the tags in a supplied filter and time range desired?
Amazing stuff!
How will you support filtering (eg. timestamp ranges and/or keywords) in relation to S3 path schema.
For example - if using fluentbit's S3 output with s3_keyformat /$TAG[2]/$TAG[0]/%Y/%m/%d/%H%M_%S/$UUID.gz how will we map a keyword to pull objects with only the tags in a supplied filter and time range desired?
@ryn9 Similar as optimization in other query engine, we can leverage partition pruning and data skipping on your data (path or content). Please see general example for data skipping in https://github.com/opensearch-project/sql/issues/1379#issuecomment-1459022886. We may look into FluentBit later. Thanks!
Decorators will be available in FluentBit/ Data-Pepper/ otel-exporter
Great initiative. Really like the price performance trade off that this solution will bring in. Few questions below:
a. What are the types of queries that doesn’t work with sql today ? b. How will DLS/FLS work ? c. How document level alerting/percolator work?
@muralikpbhat Thanks for all the comment! Please find my answer inline as below.
1.Can we think about and call out what are the downsides of doing query planing in spark? Will it restrict some of the existing features of open search? What are those?
In our demo, we use Spark SQL mostly for building skipping index and MV into OpenSearch index. Finally all query and dashboard works with the index as before.
As you asked below, I assume we're talking about Spark SQL query with OS index involved, if so there are limitations:
a. What are the types of queries that doesn’t work with sql today ?
OpenSearch functions including full text search and aggregation: this maybe solved by either improving OS-Hadoop or introducing our OS SQL plugin into Spark.
How will DLS/FLS work ?
I think we need separate AuthN/Z for raw data on S3. If you're talking about OS index, the query sent to OS is still DSL which may work. We need deep dive.
How document level alerting/percolator work?
I think all OS feature can work with MV. But for raw data, I'm not sure. Need to understand the use case and workflow.
2.How are we thinking about life cycle management of materialised views ? We need an ability to delete old MVs. Assuming maximus table and skipping indices don’t need that as they will not be very huge.
Yes, we're considering MV as second level and on-demand acceleration strategy. We will provide standard SQL API for higher level application to use, such as SHOW/DROP MV.
3.Are we using data streams for MV so that we don’t need explicit index rotation ?
As shown in demo above, the sink (destination) of streaming job behind MV is regular OpenSearch index. I think we can make it any OpenSearch object as long as OpenSearch-Hadoop connector can support it.
4.Can we think of on-demand materialised views instead of keeping it up to date (cost reduction)
Yes, that's what we're doing in the demo. We ignore the strong consistency between MV and source intentionally.
5.In case of MV, can the query span across MV and raw data ? (Case where one data file is projected completely and the other is not)
Yes, because MV itself is a table too. User can use it in any query with raw data. We didn't do this in the demo because currently OS-Hadoop doesn't extend Spark Catalog so efforts required to register MV or any OS index to Spark catalog.
Meanwhile, I'm not sure what specific use case or query you're referring to. Actually we also consider and may need this in future for Hybrid Scan capability. Hybrid scan will union the MV data and latest raw data. This will be helpful for customer who want strong consistency.
6.Similarly, can the query span across fields in MV and raw data for the same document? (Not for fields in skipping index, but for fields in MVs covered index)
Not pretty sure what the query looks like. I think it's possible as long as there is primary key field in MV correlated to row in raw data.
Would joins involve pulling data to RDDs?
Would joins involve pulling data to RDDs?
could you eleberate more? do you mean join OpenSearch Index and S3?
could you eleberate more? do you mean join OpenSearch Index and S3?
An example would help explain this better. Consider following datasets,
users [20 billion docs, ~ 2 TB]
user_id, user_name, user_location
pages [1 trillion docs, ~ 90 TB]
page_id, website_id
page_views [10 trillion docs, over 1 PB]
hour_timestamp
user_id
page_id
If I have to prepare a report every day, to summarize page view pattern in the 7 days - top 100 pages and top 100 locations, with following result schemas,
SELECT
DATE(pv.hour_timestamp) AS day, HOUR(pv.hour_timestamp) AS hour, pv.page_id, p.website_id, COUNT(*) AS views
FROM
page_views pv JOIN pages p ON pv.page_id = p.page_id
WHERE
pv.hour_timestamp >= date_sub(current_date(), interval 7 days)
GROUP BY
day, hour, pv.page_id, p.website_id
ORDER BY views DESC LIMIT 100
SELECT
DATE(pv.hour_timestamp) AS day, HOUR(pv.hour_timestamp) AS hour, u.user_location, COUNT(*) AS views
FROM
page_views pv JOIN users u ON pv.user_id = u.user_id
WHERE
pv.hour_timestamp >= date_sub(current_date(), interval 7 days)
GROUP BY
day, hour, u.user_location
ORDER BY views DESC
LIMIT 100
Assuming users
& pages
are completely available in Opensearch storage in a reasonably large cluster, and page_views
is a materialized view, with most data in S3, I'd like to understand how we plan to make the joins work. Would Spark data frames be loaded with data fetched from Opensearch index and Opensearch materialized views, and then processed within Spark runtime? And do we intend to push down some of the compute to Opensearch, as we could avoid good amount of network transfers?
Introduction
We received a feature request for query execution on object stores in OpenSearch.
We have investigated the possibility to build a new solution for OpenSearch uses and leverage object store as storage. Which includes
We found the challenges are
We found these work have been solved by general purpose data preprocessing system, E.g. Presto, Spark, Trino. And build such a platform require years to mature.
Idea
The initial idea is
High level diagram:
User Experience
Epic