leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 5 forks source link

add revamped catalog #98

Closed andersy005 closed 2 months ago

andersy005 commented 2 months ago

this PR introduces a new schema for the catalog. the new schema leverages Pangeo-Forge's existing metadata from feedstock's meta.yaml. for datasets ingested outside the Pangeo-Forge workflow, our approach enables us to specify necessary metadata, effectively integrating them with other Pangeo-Forge generated datasets. Furthermore, this schema supports extending existing metadata. when additional metadata are defined, they are merged with the existing metadata at catalog build time.

Cc @jbusecke / @norlandrhagen / @katamartin

andersy005 commented 2 months ago

@jbusecke, i've created sub-directories within catalog/ to differentiate leap-ingested and leap-produced datasets.

currently, there are examples in catalog/leap-ingested, and i'm looking to plan out the catalog/leap-produced strategy. my idea is to populate this directory with samples from proto_feedstock. Since we're already creating catalog.yaml files for leap-produced feedstocks in their respective repos, perhaps we can streamline our data management by compiling a single file in data-management/catalog/leap-produced that contains links to the respective catalog.yaml files in repositories like proto-feedstock.

this approach could help us avoid unnecessary duplication of information. i would appreciate your thoughts on this proposal.

jbusecke commented 2 months ago

@jbusecke, i've created sub-directories within catalog/ to differentiate leap-ingested and leap-produced datasets.

currently, there are examples in catalog/leap-ingested, and i'm looking to plan out the catalog/leap-produced strategy. my idea is to populate this directory with samples from proto_feedstock. Since we're already creating catalog.yaml files for leap-produced feedstocks in their respective repos, perhaps we can streamline our data management by compiling a single file in data-management/catalog/leap-produced that contains links to the respective catalog.yaml files in repositories like proto-feedstock.

this approach could help us avoid unnecessary duplication of information. i would appreciate your thoughts on this proposal.

Thanks for working on this @andersy005! I am actually not sure we need to make the distinction of ingested/produced/curated at the catalog level at all. I am imagining a minimal feedstock repo that perhaps just contains feedstock/catalog.yaml and feedstock/meta.yaml at first (this should be enough to parse things into the catalog?). Later on these could serve to build pyramids from pre-existing stores for example.

That way we would have a clear pattern: Every catalog tile is populated by a feedstock repository, no matter if the 'store path' points to a dataset on the LEAP storage or elsewhere.

andersy005 commented 2 months ago

@jbusecke, i've removed the leap-ingested vs leap-produced distinction.

That way we would have a clear pattern: Every catalog tile is populated by a feedstock repository, no matter if the 'store path' points to a dataset on the LEAP storage or elsewhere.

👍🏽 i like this idea. let's use the proto-feedstock repo as a prototype for refining the schema for the catalog.yaml. i've suggested some changes here: https://github.com/leap-stc/proto_feedstock/pull/19#issuecomment-2057571107

and once we've settled on the final version of the catalog.yaml, we can go ahead and create feedstock repos for the existing datasets on the main branch. how does this sound? @norlandrhagen / @jbusecke

jbusecke commented 2 months ago

Thats exactly what I had in mind @andersy005! Let me review the changes on the feedstock repo.

jbusecke commented 2 months ago

Ok proto_feedstock should be ready. Where would I 'register' the repo in here?

jbusecke commented 2 months ago

Got another feedstock set up, but we can add that once this is merged?