Open caldeirav opened 2 years ago
I would add (and can do the uploads):
I will check with Hewson on the wisdom of also uploading GHG Emitters data (EPA) and other EPA datasets.
To clarify: the following two buckets will each be mounted via trino catalog:
ocp-odh-data-bucket1-s3
ocp-odh-os-demo-s3
And the physical landing bucket will NOT be mounted to a trino catalog: redhat-osc-physical-landing-647521352890
To clarify: the following two buckets will each be mounted via trino catalog:
ocp-odh-data-bucket1-s3
ocp-odh-os-demo-s3
And the physical landing bucket will NOT be mounted to a trino catalog: redhat-osc-physical-landing-647521352890
@erikerlandson this is absolutely correct. @MichaelTiemannOSC it is arguable whether the hand-curated corporate data should be managed in osc-physical-landing since this is actually not a source we can reconcile against. How is this data produced and how do we intend to "track" the management of data and changes?
The hand-curated data was created by Hewson and me to drive a demo of the corp data browser. In the fullness of time the corp data browser will be browsing actual corp data uploaded to the Data Commons, but until we have such, we need to use this hand-curated data, which is managed and updated for the browser's benefit solely.
@MichaelTiemannOSC in this case for the hand-curated corporate data, my proposal would be to manage it directly into a dedicated Trino catalogue - and if possible manage the versioning of the data using code. As discussed today we would be managing the source data bucket in append mode i.e. we only write to add new data sets with no modification allowed (you have a need to update).
I am not able to see any schema or tables in osc_datacommons_dev. I can see that osc_datacommons_dev is the name of a catalog, but the schemas in that catalog are empty and hence there are no tables. The trino-* notebooks need to be updated so they can dictate the pattern to follow for other notebooks.
This should be working with latest updates to trino rules.json
:
https://github.com/operate-first/apps/blob/master/odh-manifests/osc-cl1/trino/base/trino-config-secret.yaml#L103
We need to revamp our bucket / data structure to have a clearer and more systematic process:
redhat-osc-physical-landing-647521352890 should be used for source data to be loaded, with only read access to the bucket for all the development team. The data should be structured by source: SPGI Urgentem PUDL
ocp-odh-data-bucket1-s3 should be used for all production data i.e. parquet files of processed data (stored under directories with pipeline name) from source + one Trino catalogue
ocp-odh-os-demo-s3 should be used for all development data i.e. parquet files of processed data (stored under directories with pipeline name) from source + one Trino catalogue
Based on the above we require:
@MichaelTiemannOSC any other source data we need to move to edhat-osc-physical-landing-647521352890?
@erikerlandson please review the above proposed structure and confirm