opensearch-project / opensearch-spark

Spark Accelerator framework ; It enables secondary indices to remote data stores.
Apache License 2.0
21 stars 33 forks source link

[FEATURE]Table Analysis API #113

Open YANG-DB opened 1 year ago

YANG-DB commented 1 year ago

Is your feature request related to a problem?

In some cases the user can benefit from a build-in table analysis API (query) so that it can have a good estimation regarding cost/compute for different operations

What solution would you like? Flint call /_async_analyze/$tableName would result with the following response:

Based on the statistics provided, here's the summarized information about the table:

- **Table Name**: `otel_traces`
- **Database**: `default`
- **Owner**: `hadoop`
- **Created By**: `Spark 3.3.2-amzn-0`
- **Table Type**: `EXTERNAL`
- **Data Format**: `json`
- **Location**: `s3://flint-data-dp-eu-west-1-beta/oteldemo`

### Statistics

- **Total Size**: 297,027,661,632 bytes (approximately 297 GB)
- **Total Rows**: 5,982,891 rows

### Technical Details

- **Serde Library**: `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe`
- **InputFormat**: `org.apache.hadoop.mapred.SequenceFileInputFormat`
- **OutputFormat**: `org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat`

implementation process:

This process can also be collected continuously so that the query engine has updated statistics for query analysis . The user may be shown this statistics while hovering over the table in the data explorer view

Screenshot 2023-10-27 at 3 37 48 PM

Do you have any additional context?

dai-chen commented 1 month ago

Not clear what's the issue. Is existing Analyze and DESC statement in Spark not sufficient?