os-climate / os_c_data_commons

Repository for Data Commons platform architecture overview, as well as developer and user documentation
Apache License 2.0
19 stars 10 forks source link

Cleanup data and structure in various S3 buckets #68

Open caldeirav opened 2 years ago

caldeirav commented 2 years ago

We need to revamp our bucket / data structure to have a clearer and more systematic process:

redhat-osc-physical-landing-647521352890 should be used for source data to be loaded, with only read access to the bucket for all the development team. The data should be structured by source: SPGI Urgentem PUDL

ocp-odh-data-bucket1-s3 should be used for all production data i.e. parquet files of processed data (stored under directories with pipeline name) from source + one Trino catalogue

ocp-odh-os-demo-s3 should be used for all development data i.e. parquet files of processed data (stored under directories with pipeline name) from source + one Trino catalogue

Based on the above we require:

@MichaelTiemannOSC any other source data we need to move to edhat-osc-physical-landing-647521352890?

@erikerlandson please review the above proposed structure and confirm

MichaelTiemannOSC commented 2 years ago

I would add (and can do the uploads):

I will check with Hewson on the wisdom of also uploading GHG Emitters data (EPA) and other EPA datasets.

erikerlandson commented 2 years ago

To clarify: the following two buckets will each be mounted via trino catalog: ocp-odh-data-bucket1-s3 ocp-odh-os-demo-s3

And the physical landing bucket will NOT be mounted to a trino catalog: redhat-osc-physical-landing-647521352890

caldeirav commented 2 years ago

To clarify: the following two buckets will each be mounted via trino catalog: ocp-odh-data-bucket1-s3 ocp-odh-os-demo-s3

And the physical landing bucket will NOT be mounted to a trino catalog: redhat-osc-physical-landing-647521352890

@erikerlandson this is absolutely correct. @MichaelTiemannOSC it is arguable whether the hand-curated corporate data should be managed in osc-physical-landing since this is actually not a source we can reconcile against. How is this data produced and how do we intend to "track" the management of data and changes?

MichaelTiemannOSC commented 2 years ago

The hand-curated data was created by Hewson and me to drive a demo of the corp data browser. In the fullness of time the corp data browser will be browsing actual corp data uploaded to the Data Commons, but until we have such, we need to use this hand-curated data, which is managed and updated for the browser's benefit solely.

caldeirav commented 2 years ago

@MichaelTiemannOSC in this case for the hand-curated corporate data, my proposal would be to manage it directly into a dedicated Trino catalogue - and if possible manage the versioning of the data using code. As discussed today we would be managing the source data bucket in append mode i.e. we only write to add new data sets with no modification allowed (you have a need to update).

MichaelTiemannOSC commented 2 years ago

I am not able to see any schema or tables in osc_datacommons_dev. I can see that osc_datacommons_dev is the name of a catalog, but the schemas in that catalog are empty and hence there are no tables. The trino-* notebooks need to be updated so they can dictate the pattern to follow for other notebooks.

erikerlandson commented 2 years ago

This should be working with latest updates to trino rules.json: https://github.com/operate-first/apps/blob/master/odh-manifests/osc-cl1/trino/base/trino-config-secret.yaml#L103

erikerlandson commented 2 years ago

xref: