Open MichaelTiemannOSC opened 2 years ago
Dependency on issue #202 in order to configure DBT pipeline for metadata ingestion
Issue nearly complete - pending unusual behavior where files are created in incorrect directory - c.FileCheckpoints.checkpoint_dir = '' https://stackoverflow.com/questions/51887758/is-there-a-way-to-disable-saving-to-checkpoints-for-jupyter-notebooks
restart mid November upon MT return
I've updated the task list with some substantial items that should be discussed and prioritized.
@MightyNerdEric and @rynofinn - i will create Jira issues for the first two items above; not sure if you can help with 3rd items - rest look like @caldeirav is needed
Related LF Jira tickets: (1) upgrade version of OpenMetaData: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24850 (2) improve dbt credential handling: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24851 (3) enable dbt file creation in a target/ subdirectory: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24852
in backlog - behind Iris
@MightyNerdEric working on upgrade of open metadata - learning helm (extract manifests directly)
@MightyNerdEric needs access to Quay to update base image, also needs source of container code in operate first; @redmikhail to help with access issues for Eric and @ryanaslett
prefer higher version than 317
have access to operate first, but need access to os-climate; @caldeirav do you have access? @erikerlandson please grant access to rest of LF team and Mikhail
@erikerlandson to provide quay access
creator privileges added
- The current dbt implementation requires copying sensitive credential information from credentials.env to ~/.dbt/profiles.yml or some such. This is bad and ugly and should be fixed so that dbt can get that information from env variables, like everything else.
- dbt creates meaningful files in a target/ subdirectory. The aicoe .gitignore file ignores any and all directories named target/ (due to Pybuilder). We need to delete that noise from the ingestion pipeline template or propagate a better .gitignore
@MichaelTiemannOSC Regarding these two issues: For the first one, could you give some additional information? Where is credentials.env stored? I'm guessing that ~/.dbt/profiles.yml is a file that's on the openmetadata server? We certainly are capable of pulling secrets from Vault into configs one the cluster, but I need a bit more info. I tried looking into dbt setup, but I couldn't find anything in our current configuration that interacts with it.
On the second item, what repo are we talking about here? When does dbt create these files? If we don't want them covered by the .gitignore, I'm guessing these are files that we want to commit to a repo, so I'll need to know when/how they're being generated in order to find a solution.
credentials.env is meant to be stored far, far away from GitHub, but within a user's ability to load the file from a home directory. This is the library that our data users are supposed to use to read that file: https://github.com/os-climate/osc-ingest-tools
dbt is part of the new pipeline that Vincent rolled out in August, and which I've been trying to replicate since October (when I was interacting heavily with trino, trino-client, OpenMetadata, and dbt developers). It's listed in the requirements for #234, and #234 is intended to provide the larger context of what we need. This particular issue was filed because of the great surprise (and potential security leakage) due to dbt's default way of handling credentials.
Just this morning, Bryon Baker made a suggestion about putting credentials.env into all .gitignore files for OS-Climate, to reduce the risk of credentials leakage. But it doesn't solve this problem, because dbt wants to read from its own files--a leak waiting to happen.
Vincent's open metadata demo gives the larger context of how dbt fits into our world. This branch (https://github.com/os-climate/essd-ingest-pipeline/tree/iceberg-dbt) of the ESSD pipeline also gives examples of dbt usage. Vincent is just finishing up the delivery of some major training--hopefully his materials spell this out better. I've only been trying to replicate what I see him doing in another context, documenting as I go. Which means that by no means do I have the larger picture of what "should be". But this issue raises "what should not be", and that is a file, necessary for the operation of dbt, that would wind up leaking credentials because nothing about "~/.dbt/profiles.yml" makes it look like it contains secrets.
Vincent has recently updated the Data Commons documentation. While written at a high level and aimed mostly at developers, the granularity and completeness encourages the platform team to add information relevant to the ops side of the platform: components, recipes, smoke tests, admin / dev / user roles, provisioning, heath checks, etc. See https://github.com/os-climate/os_c_data_commons/blob/main/os-c-data-commons-developer-guide.md
I encourage all who are responsible for keeping these systems running to read this documentation as a way to understand what developers (and ultimately users) are expecting, and to write such documentation that it's easy for existing and new platform maintainers to also find what they need to find and do what they need to do.
blocked pending OpenMetadata 13.1 upgrade
OM 13.1 is available. The evil and insidious profiles.yml file, which requires secrets but should not contain secrets, remains unaddressed. Filing a new issue about that.
Now that we are unblocked on openmetadata, there are several other questions that need to be answered, ie. storing dbt intermediate files in github or not, various gitignore problems that may or may not be solved by the latest template, etc. Please consider this a bump to addressing those questions.
@MightyNerdEric will work on this issue for week of 13-Feb
@caldeirav can you address: (2) improve dbt credential handling: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24851 (3) enable dbt file creation in a target/ subdirectory: https://jira.linuxfoundation.org/plugins/servlet/desk/portal/2/IT-24852
The ESSD dataset (s3://redhat-osc-physical-landing-647521352890/ESSD/) can be an exemplar for Open Metadata onboarding. The dataset comes with data dictionaries, and there is a iceberg data pipeline notebook at https://github.com/os-climate/essd-ingest-pipeline/blob/iceberg/notebooks/osc-essd-ingest.ipynb.
As a data pipeline implementor, I want to work from a common template that describes the metadata needed to connect this dataset to a data catalog browser and to understand the various levels of data interoperability that can be achieved/advertised by properly instantiating all the metadata reasonable for this dataset:
@HeatherAck for visibility