Support multiple data source use case in existing data source selector

BionIT commented 5 months ago

Is your feature request related to a problem? Please describe.

In https://github.com/opensearch-project/OpenSearch-Dashboards/issues/5717, we found that devtools and tutorial add sample data page both have duplicated code for a data source picker, in https://github.com/opensearch-project/OpenSearch-Dashboards/issues/5712, we want to use data source picker as well. To avoid code duplication, we proposed to extract the duplicate code logic into one component to satisfy the use case. After found that there is an experimental data selector component implemented in https://github.com/opensearch-project/OpenSearch-Dashboards/pull/5167, we suggest to make it adaptable to our use cases in multiple data source.

Describe the solution you'd like

To give full context about the picker finalized with UX in https://github.com/opensearch-project/OpenSearch-Dashboards/issues/5712,, how we want the data source picker to behave is that and please note what data source means in multiple data source is a connection to cluster:

it should have prepend with name Data connection
when the page loads, it should have a default option Local cluster chosen
the input field can be cleared and then shows Select a data connection in the input
when the picker loads, it should be able to use the input function to call saved object API to get data sources or use the pre-fetched data source options, and it should be able to update the selected option of the parent component to inform parent about the selected data source
when error happens, a toast message should show up

Describe alternatives you've considered

Have a separate data source picker for multiple data source

Additional context

Add any other context or screenshots about the feature request here.

BionIT commented 5 months ago

Hi @mengweieric , thanks for sharing your knowledge about the data source selector and please let me know if there is any concern or question in regards to this request

ashwin-pc commented 5 months ago

A few comments on the requirements:

Why would we want to clear a datasource? Cant we always have the user select atleast one datasource?
What happens if the user does not have a local cluster?
The cluster is not a datasource. We treat an indexpattern, table or index as a datasource, how does the cluster fit into that?

These are honestly questions for @shanilpa and @kgcreative unless these are technical requirements :)

kgcreative commented 5 months ago

There's been some ambiguity in terms. In this case, this refers to a data source connection but we've been using data source as a short hand. Let's align terminology with @dagney here, as this can quickly lead to awkward nomenclature

kgcreative commented 5 months ago

For now, Select cluster seems to fix the term confusion

seraphjiang commented 5 months ago

There's been some ambiguity in terms. In this case, this refers to a data source connection but we've been using data source as a short hand. Let's align terminology with @dagney here, as this can quickly lead to awkward nomenclature

@kgcreative @dagneyb @BionIT @ashwin-pc @mengweieric

Connection could be used to define required parameters in order to establish to a connection. while we are moving to multi-datasource world. there could be more data source type we want to support with different type of connection.

e.g. OpenSearch/Elasticsearch is cluster based architecture RESTFul service. in opensearch_dashboards.yml we could use following parameters to define which opensearch for OSD connect to.

opensearch.hosts: ["http://localhost:9200"]
opensearch.username: "opensearch_dashboards_system"
opensearch.password: "pass"

Let's also look at non-OpenSearch datasource type and connection. e.g. MySql connection will need host, auth, protocol in order to connect.

mysql --host=localhost --user=myname --password mydb --protocol={TCP|SOCKET|PIPE|MEMORY}

Now we could see the common part, each connection need to define at least host, auth, protocol in order to connect.

However we may has question, does DataSource equals to Connection? maybe yes, in our current setup. But let's look at more comprehensive case. A customer setup their own OpenSearch Dashboards instance with below config with multiple datasource disabled, it means they use their own index name for OpenSearch Dashboards index. When customer choose to enable multiple datasource and migrate to that setup. Not only provide connection information for OSD to connect to OpenSearch, we will also allow customer to config their default index name so, OSD know which user specific index that stored the Dashboards saved object meta, then OSD could locate right index to load data for migration.

opensearchDashboards.index: ".my_opensearch_dashboards"

Summarize the information

Each type of DataSource may contains different type of connection as well as other information
Connection is to define the parameter in order to connect to local/remote datasource
DataSource will contains both connection information and other non-connection information.

Additional callout: Index pattern or future DataView are not a datasource. Index Pattern, DataView, DataTable, are logical/physical collection of data to be consumed for specific usage. Datasources map to Server, Cluster or Database which organize Index Pattern, DataView, DataTable in a way to resolve more comprehensive business need.

It is great to see we start to think reusable component. e.g. https://github.com/opensearch-project/OpenSearch-Dashboards/pull/5167 However we need to be sure design reusability for each specific scenario and scope.

Follow the single responsibility principle, It is suggested to design two picker components separately, avoid over engineering. 1) DataSource picker, and 2) DataSet picker (for DataSet, DataView, DataTable, Index Pattern)

In the Migration scenario for multiple datasource feature, we will need first one.

ashwin-pc commented 5 months ago

@seraphjiang I'm with you here. After speaking to @bandinib-amzn I too agree that we need 2 different pickers for the two separate usecases. If we can disambiguate the two names, that is ideal. Datasource is overloaded at the moment and we need alignment on it. I am not stuck on any particular name, and your suggestion of "1) DataSource picker, and 2) DataSet picker (for DataSet, DataView, DataTable, Index Pattern)" works for me. @kgcreative @dagneyb what do you think?

kgcreative commented 5 months ago

So there's four layers that we are conflating here, and I think as we make more data available, this will continue to add confusion

Cluster Connection - This is the connection between OSD and an OpenSearch backend.
- Before the Multiple Data Source feature, this was the host cluster where OSD is housed. Once we have Multiple Data Sources Enabled, this refers to any connections between OSD and Any compatible source (presumably in the future, if we have the right abstractions, this could be ODBC connections, for example). Today, the MDS feature refers to this as a "Data Source"
Cross-cluster Connection / SQL / Federated Connections
- We don't really have UI for this today, but cross-cluster connections here are sent from a coordinator cluster, to another cluster. Examples for this are cross-cluster search. I believe SQL connections work in similar way as well, where the SQL plugin can connect to an external source, but this is proxied from a coordinator cluster. With MDS enabled, we have some potential confusion here, where you have OSD -> Cluster Connection -> Connection -> Index
Abstraction
- Today, we have Index Patterns, which is an abstraction for indexes. These also support scripted fields and run-time fields. When you have cross-cluster search, an index pattern can include cross-cluster indexes as part of the pattern definition. If we add additional Cluster & SQL or Spark connections, then we could include additional abstractions here as well, such as Data Views, materialized views, covered/skip indexes, accelerations, etc.
Data Source
- Colloquially, customers think of data sources as: Indexes, Tables, S3 buckets, prometheus metrics, etc -- this is where the data actually lives. In many plugins (For example, alerting), there is a "Data Source" input where people can choose an index. This is fairly inconsistent across the product.

I think we really need to nail down the terminology and different layers here, or things are going to get really confusing really fast.

dagneyb commented 5 months ago

Proposing names for these to clarify going forward:

"Data connection"- this can be used to represent 1 and 2 in @kgcreative 's outline above
"Data set" - this can be used to cover 3 and 4 in @kgcreative 's outline above

Thoughts? @brijos @anirudha @ashwin-pc @seraphjiang @BionIT @kamingleung

kgcreative commented 5 months ago

I agree with "Data connection" as a high level concept, but I think "Data set" needs some deeper thought

BionIT commented 5 months ago

Agree that "Data connection" makes sense in sample data and devtools when selecting the cluster. Since we are thinking of making the names less confusing, I think we should review the existing terms we used in the dashboard, and make sure we can have it clear in next release. Right now, we have data source in management and also as a plugin and pickers refer different data source concept in the dashboard

ashwin-pc commented 4 months ago

@dagneyb Im aligned with the names.

opensearch-project / OpenSearch-Dashboards

Support multiple data source use case in existing data source selector #5790