OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
Google Sheets stores a lot of valuable data, including:
Tables that were collected by manual data input or via Google Forms
Tables that were generated to export data from RDBMS for reporting
Tables with reference and master data (e.g. dictionaries)
All these cases relate to data that need to be governed. For instance, data are exported from RDBMS to Google Sheets to be shared with contractors, auditors, regulators and other external stakeholders.
Objectives
Metadata ingestion from Google sheets can help to:
Catch Google Sheets table schemas
Catch lineage to source RDBMS tables
Keep Google Sheets data owners up to date
Metadata mapping
Here is a mapping between Open Metadata and Google Sheets:
sequenceDiagram
User ->> Google Cloud: 1.1. create service account and json key
User ->> Google Drive: 2.1. share folders to service account
User ->> OM connector: 2.2. configure Open Metadata connector
OM connector ->> Google Drive API: 3.1. request drives
Google Drive API -->> OM connector: 3.2. got accessible drives
OM connector ->> Google Drive API: 3.3. request sheets for drives
Google Drive API -->> OM connector: 3.4. got google sheets
OM connector ->> Google Sheets API: 4.1. check if metadata sheet exists in spreadsheet
Google Sheets API -->> OM connector: 4.2. got sheets with filled metadata sheet
OM connector ->> Google Sheets API: 5.1. request metadata sheet
OM connector ->> OM connector: 5.2. parse metadata
OM connector ->> Open Metadata: 5.3. add metadata to Open Metadata
Metadata template sheet
We propose a metadata template sheet to be copied to spreadsheet and filled.
spreadsheet owner -> Open Metadata database schema owner
sheet owner -> Open Metadata table owner
Some issues are related with this:
If there are more than one spreadsheet owner in Google Drive, actual owner can be defined as a user in lastModifyingUser
Sheet owners can be defined in the metadata template sheet (optionally)
Ingestion controls
Some controls should be included to connector:
Custom attributes (include / exclude)
Ingest spreadsheet owner (true / false)
Ingest data lineage (true / false)
Schema changes
Technically Google Sheets are not relational tables, because actually data are stored in grids with cells. It makes tricky to detect schema changes.
Thus, once a data structure was changed in a spreadsheet, metadata sheet should be updated by owner. After that, Open Metadata should update version of related table entity, if schema was changed.
Data lineage
To build table data lineage, source tables should be defined in the metadata sheet (see above).
Authorization
To authorize Open Metadata connector in Google Drive and Sheet API, user should previously:
create Google Cloud service account
create json key file for this service account
share spreadsheets or parent folders with the service account
use json key file for Open Metadata connection
Limitations
Google Sheets API has read requests limit (60 requests per minutes per user per project).
Out of scope
Ingestion of sheet recalculations (can be ingested as pipelines later).
Background
Google Sheets stores a lot of valuable data, including:
All these cases relate to data that need to be governed. For instance, data are exported from RDBMS to Google Sheets to be shared with contractors, auditors, regulators and other external stakeholders.
Objectives
Metadata ingestion from Google sheets can help to:
Metadata mapping
Here is a mapping between Open Metadata and Google Sheets:
deleted
statusdeleted
statusMetadata ingestion flow
Metadata template sheet
We propose a metadata template sheet to be copied to spreadsheet and filled.
Sheet owners
According to mapping above:
Some issues are related with this:
lastModifyingUser
Ingestion controls
Some controls should be included to connector:
Schema changes
Technically Google Sheets are not relational tables, because actually data are stored in grids with cells. It makes tricky to detect schema changes. Thus, once a data structure was changed in a spreadsheet, metadata sheet should be updated by owner. After that, Open Metadata should update version of related table entity, if schema was changed.
Data lineage
To build table data lineage, source tables should be defined in the metadata sheet (see above).
Authorization
To authorize Open Metadata connector in Google Drive and Sheet API, user should previously:
Limitations
Google Sheets API has read requests limit (60 requests per minutes per user per project).
Out of scope
Ingestion of sheet recalculations (can be ingested as pipelines later).