Closed andersy005 closed 2 months ago
@jbusecke, i've created sub-directories within catalog/
to differentiate leap-ingested
and leap-produced
datasets.
currently, there are examples in catalog/leap-ingested
, and i'm looking to plan out the catalog/leap-produced
strategy. my idea is to populate this directory with samples from proto_feedstock. Since we're already creating catalog.yaml files for leap-produced feedstocks in their respective repos, perhaps we can streamline our data management by compiling a single file in data-management/catalog/leap-produced
that contains links to the respective catalog.yaml
files in repositories like proto-feedstock.
this approach could help us avoid unnecessary duplication of information. i would appreciate your thoughts on this proposal.
@jbusecke, i've created sub-directories within
catalog/
to differentiateleap-ingested
andleap-produced
datasets.currently, there are examples in
catalog/leap-ingested
, and i'm looking to plan out thecatalog/leap-produced
strategy. my idea is to populate this directory with samples from proto_feedstock. Since we're already creating catalog.yaml files for leap-produced feedstocks in their respective repos, perhaps we can streamline our data management by compiling a single file indata-management/catalog/leap-produced
that contains links to the respectivecatalog.yaml
files in repositories like proto-feedstock.this approach could help us avoid unnecessary duplication of information. i would appreciate your thoughts on this proposal.
Thanks for working on this @andersy005! I am actually not sure we need to make the distinction of ingested/produced/curated at the catalog level at all. I am imagining a minimal feedstock repo that perhaps just contains feedstock/catalog.yaml
and feedstock/meta.yaml
at first (this should be enough to parse things into the catalog?). Later on these could serve to build pyramids from pre-existing stores for example.
That way we would have a clear pattern: Every catalog tile is populated by a feedstock repository, no matter if the 'store path' points to a dataset on the LEAP storage or elsewhere.
@jbusecke, i've removed the leap-ingested
vs leap-produced
distinction.
That way we would have a clear pattern: Every catalog tile is populated by a feedstock repository, no matter if the 'store path' points to a dataset on the LEAP storage or elsewhere.
👍🏽 i like this idea. let's use the proto-feedstock
repo as a prototype for refining the schema for the catalog.yaml
. i've suggested some changes here: https://github.com/leap-stc/proto_feedstock/pull/19#issuecomment-2057571107
and once we've settled on the final version of the catalog.yaml
, we can go ahead and create feedstock repos for the existing datasets on the main branch. how does this sound? @norlandrhagen / @jbusecke
Thats exactly what I had in mind @andersy005! Let me review the changes on the feedstock repo.
Ok proto_feedstock should be ready. Where would I 'register' the repo in here?
Got another feedstock set up, but we can add that once this is merged?
this PR introduces a new schema for the catalog. the new schema leverages Pangeo-Forge's existing metadata from feedstock's meta.yaml. for datasets ingested outside the Pangeo-Forge workflow, our approach enables us to specify necessary metadata, effectively integrating them with other Pangeo-Forge generated datasets. Furthermore, this schema supports extending existing metadata. when additional metadata are defined, they are merged with the existing metadata at catalog build time.
Cc @jbusecke / @norlandrhagen / @katamartin