Onboard ESSD dataset using Open Metadata

MichaelTiemannOSC commented 2 years ago

The ESSD dataset (s3://redhat-osc-physical-landing-647521352890/ESSD/) can be an exemplar for Open Metadata onboarding. The dataset comes with data dictionaries, and there is a iceberg data pipeline notebook at https://github.com/os-climate/essd-ingest-pipeline/blob/iceberg/notebooks/osc-essd-ingest.ipynb.

As a data pipeline implementor, I want to work from a common template that describes the metadata needed to connect this dataset to a data catalog browser and to understand the various levels of data interoperability that can be achieved/advertised by properly instantiating all the metadata reasonable for this dataset:

[x] Still need to update OpenMetadata to 0.12.2 or later. Given that we are very much in the development stage, it might make sense to install 0.13.0 preview, released yesterday: https://github.com/open-metadata/OpenMetadata/releases/tag/0.13.0-preview
[ ] The current dbt implementation requires copying sensitive credential information from credentials.env to ~/.dbt/profiles.yml or some such. This is bad and ugly and should be fixed so that dbt can get that information from env variables, like everything else.
[ ] dbt creates meaningful files in a target/ subdirectory. The aicoe .gitignore file ignores any and all directories named target/ (due to Pybuilder). We need to delete that noise from the ingestion pipeline template or propagate a better .gitignore
[ ] should sql files generated in the dbt process be preserved in github (as part of data reproducibility) or should they be ignored as purely derived files? What other rules should apply to what other files that dbt generates or uses?
[ ] The sample WRI README.md file (https://github.com/os-climate/wri-gppd-ingestion-pipeline/blob/master/README.md) is still just a project template file and does not describe the full theory of all the steps and components needed to fully implement a proper ingestion pipeline (making it difficult for the ESSD dataset to further exemplify and elaborate what it should be doing.
[ ] We should aim to make ESSD easily comparable with other global CO2 data (such as ClimateTrace) and demonstrate how OM details facilitate both comparability and consequences of data updates.

@HeatherAck for visibility

caldeirav commented 2 years ago

Dependency on issue #202 in order to configure DBT pipeline for metadata ingestion

HeatherAck commented 2 years ago

Issue nearly complete - pending unusual behavior where files are created in incorrect directory - c.FileCheckpoints.checkpoint_dir = '' https://stackoverflow.com/questions/51887758/is-there-a-way-to-disable-saving-to-checkpoints-for-jupyter-notebooks

HeatherAck commented 2 years ago

restart mid November upon MT return

MichaelTiemannOSC commented 1 year ago

I've updated the task list with some substantial items that should be discussed and prioritized.

HeatherAck commented 1 year ago

@MightyNerdEric and @rynofinn - i will create Jira issues for the first two items above; not sure if you can help with 3rd items - rest look like @caldeirav is needed

HeatherAck commented 1 year ago

Related LF Jira tickets: (1) upgrade version of OpenMetaData: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24850 (2) improve dbt credential handling: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24851 (3) enable dbt file creation in a target/ subdirectory: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24852

HeatherAck commented 1 year ago

in backlog - behind Iris

HeatherAck commented 1 year ago

@MightyNerdEric working on upgrade of open metadata - learning helm (extract manifests directly)

HeatherAck commented 1 year ago

@MightyNerdEric needs access to Quay to update base image, also needs source of container code in operate first; @redmikhail to help with access issues for Eric and @ryanaslett

HeatherAck commented 1 year ago

prefer higher version than 317

HeatherAck commented 1 year ago

have access to operate first, but need access to os-climate; @caldeirav do you have access? @erikerlandson please grant access to rest of LF team and Mikhail

HeatherAck commented 1 year ago

@erikerlandson to provide quay access

HeatherAck commented 1 year ago

creator privileges added

eb-oss commented 1 year ago

The current dbt implementation requires copying sensitive credential information from credentials.env to ~/.dbt/profiles.yml or some such. This is bad and ugly and should be fixed so that dbt can get that information from env variables, like everything else.

dbt creates meaningful files in a target/ subdirectory. The aicoe .gitignore file ignores any and all directories named target/ (due to Pybuilder). We need to delete that noise from the ingestion pipeline template or propagate a better .gitignore

@MichaelTiemannOSC Regarding these two issues: For the first one, could you give some additional information? Where is credentials.env stored? I'm guessing that ~/.dbt/profiles.yml is a file that's on the openmetadata server? We certainly are capable of pulling secrets from Vault into configs one the cluster, but I need a bit more info. I tried looking into dbt setup, but I couldn't find anything in our current configuration that interacts with it.

On the second item, what repo are we talking about here? When does dbt create these files? If we don't want them covered by the .gitignore, I'm guessing these are files that we want to commit to a repo, so I'll need to know when/how they're being generated in order to find a solution.

MichaelTiemannOSC commented 1 year ago

credentials.env is meant to be stored far, far away from GitHub, but within a user's ability to load the file from a home directory. This is the library that our data users are supposed to use to read that file: https://github.com/os-climate/osc-ingest-tools

dbt is part of the new pipeline that Vincent rolled out in August, and which I've been trying to replicate since October (when I was interacting heavily with trino, trino-client, OpenMetadata, and dbt developers). It's listed in the requirements for #234, and #234 is intended to provide the larger context of what we need. This particular issue was filed because of the great surprise (and potential security leakage) due to dbt's default way of handling credentials.

Just this morning, Bryon Baker made a suggestion about putting credentials.env into all .gitignore files for OS-Climate, to reduce the risk of credentials leakage. But it doesn't solve this problem, because dbt wants to read from its own files--a leak waiting to happen.

Vincent's open metadata demo gives the larger context of how dbt fits into our world. This branch (https://github.com/os-climate/essd-ingest-pipeline/tree/iceberg-dbt) of the ESSD pipeline also gives examples of dbt usage. Vincent is just finishing up the delivery of some major training--hopefully his materials spell this out better. I've only been trying to replicate what I see him doing in another context, documenting as I go. Which means that by no means do I have the larger picture of what "should be". But this issue raises "what should not be", and that is a file, necessary for the operation of dbt, that would wind up leaking credentials because nothing about "~/.dbt/profiles.yml" makes it look like it contains secrets.

MichaelTiemannOSC commented 1 year ago

Vincent has recently updated the Data Commons documentation. While written at a high level and aimed mostly at developers, the granularity and completeness encourages the platform team to add information relevant to the ops side of the platform: components, recipes, smoke tests, admin / dev / user roles, provisioning, heath checks, etc. See https://github.com/os-climate/os_c_data_commons/blob/main/os-c-data-commons-developer-guide.md

I encourage all who are responsible for keeping these systems running to read this documentation as a way to understand what developers (and ultimately users) are expecting, and to write such documentation that it's easy for existing and new platform maintainers to also find what they need to find and do what they need to do.

HeatherAck commented 1 year ago

blocked pending OpenMetadata 13.1 upgrade

MichaelTiemannOSC commented 1 year ago

OM 13.1 is available. The evil and insidious profiles.yml file, which requires secrets but should not contain secrets, remains unaddressed. Filing a new issue about that.

MichaelTiemannOSC commented 1 year ago

Now that we are unblocked on openmetadata, there are several other questions that need to be answered, ie. storing dbt intermediate files in github or not, various gitignore problems that may or may not be solved by the latest template, etc. Please consider this a bump to addressing those questions.

HeatherAck commented 1 year ago

@MightyNerdEric will work on this issue for week of 13-Feb

HeatherAck commented 1 year ago

@caldeirav can you address: (2) improve dbt credential handling: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24851 (3) enable dbt file creation in a target/ subdirectory: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24852

os-climate / os_c_data_commons

Onboard ESSD dataset using Open Metadata #183