Support Cloud Spanner connector

open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.

https://open-metadata.org

Apache License 2.0

5.66k stars 1.06k forks source link

Support Cloud Spanner connector #4221

Open yu-iskw opened 2 years ago

yu-iskw commented 2 years ago

Is your feature request related to a problem? Please describe. Since Cloud Spanner doesn't allow us to manage both of table-level and column level metadata as descriptions and labels, we have to manage them outside of Spanner.

Describe the solution you'd like It would be great to support a new connector for Googlt Cloud Spanner.

Describe alternatives you've considered I have no options on OpenMetadata.

Additional context NA

yu-iskw commented 1 year ago

@OnkarVO7 I would love to take the issue. I don't completely understand steps to implement a new connector. Can you tell me if or not I am correct on the subsequent steps? From my understanding, we have to take the two steps. My another question is, can we use a metadata ingest command after implementing the step two?

Impelemen a schema for Spanner to generate a model class.
- e.g. BigQuery: https://github.com/open-metadata/OpenMetadata/blob/main/openmetadata-spec/src/main/resources/json/schema/entity/services/connections/database/bigQueryConnection.json
Implement a souce class for Spanner
- e.g. BigQuery: https://github.com/open-metadata/OpenMetadata/blob/main/ingestion/src/metadata/ingestion/source/database/bigquery/metadata.py

OnkarVO7 commented 1 year ago

@yu-iskw Yes you are on the correct path You can reference this database connector PR for the other necessary file changes that are required.

yu-iskw commented 1 year ago

@OnkarVO7 Thank you for sharing the demo PR. I am going to look into it.

yu-iskw commented 1 year ago

@OnkarVO7 I have a question about FQN underneath OpenMetadata. As far as I know as, basically FQN in OM is composed with database, schema and table.

Take BigQuery , for instance.

GCP project ID => Database in OM
BigQuery dataset ID => Schema in OM
BigQuery table ID => Table in OM

Meanwhile, a GCP project can have multiple Spanner instances. A Spanner instance can have multiple databases. A Spanner database can have multiple table. So, how should we compose a FQN of a Spanner table? Specifically, I am wondering how we can manage information about GCP prject IDs for Spanner.

GCP project in which Spanner instances exist => ?
Spanner instance => Database in OM
Spanner database => Schema in OM
Spanner table => Table in OM

OnkarVO7 commented 1 year ago

@yu-iskw for this case, below can be ingested as you are mentioning

Spanner instance => Database in OM
Spanner database => Schema in OM
Spanner table => Table in OM

For the GCP project you can create a tag in OM with project_id and attach that tag to the database. Let me know if that makes sense

cc: @pmbrull @harshach

yu-iskw commented 1 year ago

@OnkarVO7 I got it. Let me make sure another option about that. At the moment, GCSCredentials accepts either of single GCP project or multiple GCP project. If we suppress the attributed with only single GCP project, we can also take advantage of a service name in OM. What do you think?

GCP project in which Spanner instances exist => Service name (multiple objects underneath GCSCredentials is prohibited)
Spanner instance => Database in OM
Spanner database => Schema in OM
Spanner table => Table in OM

OnkarVO7 commented 1 year ago

@yu-iskw that might not be recommended here. Main goal of the service_name or services in OM is so that users can enter any name by themselves and add multiple services into OM of the same source with configurations as they please.

Due to this we never edit the service_name on the ingestion side.

If we add a logic for GCP project id => Service name the above goal will be invalidated

yu-iskw commented 1 year ago

@OnkarVO7 I understand. Thanks!

yu-iskw commented 1 year ago

@OnkarVO7 I am wondering how we can create an engine of sqlalchemy, because python-spanner-sqlalchemy requires all information of a project ID, an instance ID and a database ID as spanner+spanner:///projects/project-id/instances/instance-id/databases/database-id. So, if we want to ingest metadata of multiple instances and databases in them, we have to dynamically change the sqlalchemy engine. Of course, we can collect information about instances in a GCP project and databases in a Spanner instance. However, I don't know a good way to dynamically pass such information to get_connection, because basically get_connection receives only a connection object like BigQueryConnection. So, we have to dynamically change any information about an instance and a database of Spanner in a connection object. That can get a bit hacky. I would like to know how to properly deal with multiple connections in a connector.

yu-iskw commented 1 year ago

@OnkarVO7 I made a pull request for the feature. I am still wondering how we should implement it. Let's discuss on the pull request.

https://github.com/open-metadata/OpenMetadata/pull/10195