open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.6k stars 1.06k forks source link

Open Metadata Google Sheets connector #11733

Open gnomolio opened 1 year ago

gnomolio commented 1 year ago

Background

Google Sheets stores a lot of valuable data, including:

  1. Tables that were collected by manual data input or via Google Forms
  2. Tables that were generated to export data from RDBMS for reporting
  3. Tables with reference and master data (e.g. dictionaries)

All these cases relate to data that need to be governed. For instance, data are exported from RDBMS to Google Sheets to be shared with contractors, auditors, regulators and other external stakeholders.

Objectives

Metadata ingestion from Google sheets can help to:

  1. Catch Google Sheets table schemas
  2. Catch lineage to source RDBMS tables
  3. Keep Google Sheets data owners up to date

Metadata mapping

Here is a mapping between Open Metadata and Google Sheets:

Google Sheet attribute Open Metadata attribute Google API endpoint
Drive name (Google Drive file API) Database name drives.get
name Table name files.get
description Table description files.get
Sheet (Google Sheets API) Table name spreadsheets.get
Column (Google Sheets API) Column spreadsheets.get
Owner (Google Drive file API) Data owner files.get
lastModifyingUser Custom attribute files.get
createdTime Custom attribute files.get
modifiedTime Custom attribute files.get
version Custom attribute files.get
webViewLink Custom attribute files.get
size Custom attribute files.get
file id Custom attribute files.get
trashed deleted status files.get
explicitlyTrashed deleted status files.get
trashingUser Custom attribute files.get
trashedTime Custom attribute files.get

Metadata ingestion flow

  sequenceDiagram

    User               ->>        Google Cloud:         1.1. create service account and json key
    User               ->>        Google Drive:         2.1. share folders to service account
    User               ->>        OM connector:         2.2. configure Open Metadata connector

    OM connector       ->>        Google Drive API:     3.1. request drives
    Google Drive API   -->>       OM connector:         3.2. got accessible drives 
    OM connector       ->>        Google Drive API:     3.3. request sheets for drives
    Google Drive API   -->>       OM connector:         3.4. got google sheets

    OM connector       ->>        Google Sheets API:    4.1. check if metadata sheet exists in spreadsheet
    Google Sheets API  -->>       OM connector:         4.2. got sheets with filled metadata sheet

    OM connector       ->>        Google Sheets API:    5.1. request metadata sheet
    OM connector       ->>        OM connector:         5.2. parse metadata
    OM connector       ->>        Open Metadata:        5.3. add metadata to Open Metadata

Metadata template sheet

We propose a metadata template sheet to be copied to spreadsheet and filled.

sheet* column* description owner_email source_table_full_name
{sheet_name} {column_name} {description of the column} {user@email.com} {service}.{schema}.{table_1}, {service}.{schema}.{table_2}

Sheet owners

According to mapping above:

  1. spreadsheet owner -> Open Metadata database schema owner
  2. sheet owner -> Open Metadata table owner

Some issues are related with this:

  1. If there are more than one spreadsheet owner in Google Drive, actual owner can be defined as a user in lastModifyingUser
  2. Sheet owners can be defined in the metadata template sheet (optionally)

Ingestion controls

Some controls should be included to connector:

  1. Custom attributes (include / exclude)
  2. Ingest spreadsheet owner (true / false)
  3. Ingest data lineage (true / false)

Schema changes

Technically Google Sheets are not relational tables, because actually data are stored in grids with cells. It makes tricky to detect schema changes. Thus, once a data structure was changed in a spreadsheet, metadata sheet should be updated by owner. After that, Open Metadata should update version of related table entity, if schema was changed.

Data lineage

To build table data lineage, source tables should be defined in the metadata sheet (see above).

Authorization

To authorize Open Metadata connector in Google Drive and Sheet API, user should previously:

  1. create Google Cloud service account
  2. create json key file for this service account
  3. share spreadsheets or parent folders with the service account
  4. use json key file for Open Metadata connection

Limitations

Google Sheets API has read requests limit (60 requests per minutes per user per project).

Out of scope

Ingestion of sheet recalculations (can be ingested as pipelines later).

haquem1 commented 1 year ago

@gnomolio @pmbrull I'd love to take a look at this issue!

pmbrull commented 1 year ago

thanks @haquem1 , assigned it to you