Open rabernat opened 2 years ago
User visits the dashboard page for the feedstock (e.g. pangeo-forge.org/dashboard/feedstock/6) and sees a clear link on this page pointing to a catalog page for the resulting dataset. The catalog page displays a URL and instructions for opening the dataset
Let's say I head over to https://pangeo-forge.org/dashboard/feedstock/3. Querying the api for recipe runs for this feedstock returns a bunch of recipe runs (some successfully completed, others failed). which criteria is used to filter out datasets that are currently listed on the https://pangeo-forge.org/catalog? I presume that some of these datasets are produced during test runs of a recipe but we only want to catalog datasets produced during the production phase, right?
$ http -v https://api.pangeo-forge.org/feedstocks/3
GET /feedstocks/3 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: api.pangeo-forge.org
User-Agent: HTTPie/2.6.0
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 7805
Content-Type: application/json
Date: Tue, 07 Jun 2022 22:28:04 GMT
Server: uvicorn
Via: 1.1 vegur
{
"id": 3,
"provider": "github",
"recipe_runs": [
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
"id": 30,
"is_test": false,
"message": null,
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-19T21:38:22",
"status": "queued",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
"id": 31,
"is_test": false,
"message": null,
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-19T21:53:13",
"status": "queued",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "refs/tags/1.0",
"id": 23,
"is_test": false,
"message": null,
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-14T22:54:27",
"status": "queued",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "refs/tags/1.1",
"id": 24,
"is_test": false,
"message": null,
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-14T23:06:54",
"status": "queued",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "refs/tags/1.2",
"id": 25,
"is_test": false,
"message": null,
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-14T23:15:01",
"status": "queued",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "refs/tags/1.3",
"id": 26,
"is_test": false,
"message": "{\"flow_id\": \"ebe8c22e-979b-41bb-9c25-d84901c680b0\"}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-14T23:28:02",
"status": "in_progress",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-15T23:39:56",
"conclusion": "failure",
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "refs/heads/main",
"id": 28,
"is_test": false,
"message": "{\"flow_id\": \"52b91300-1436-4e7b-882e-cf28da6f2335\", \"deployment_id\": 548187443}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-15T23:19:11",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-19T22:13:55",
"conclusion": "failure",
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
"id": 32,
"is_test": false,
"message": "{\"flow_id\": \"ee424a76-90f6-4201-a94a-1fdb6b4e9de7\", \"deployment_id\": 550120209}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-19T22:02:41",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
"id": 33,
"is_test": false,
"message": null,
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-20T00:51:58",
"status": "queued",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-20T01:23:03",
"conclusion": "failure",
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
"id": 34,
"is_test": false,
"message": "{\"flow_id\": \"0563d546-0ffe-4b73-b038-36b40592680c\", \"deployment_id\": 550180443}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-20T00:56:31",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-20T18:10:14",
"conclusion": "success",
"dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-35/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "1f3c9a6b6cdca841f0cccf8827005db7be8fa61c",
"id": 35,
"is_test": true,
"message": "{\"flow_id\": \"6b33c556-0770-4c68-accf-69e16ca217a1\"}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-20T17:35:38",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-20T21:07:32",
"conclusion": "failure",
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
"id": 36,
"is_test": false,
"message": "{\"flow_id\": \"34beb792-9b70-43f9-b4c5-aa2b9bee7172\", \"deployment_id\": 550680460}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-20T18:17:29",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-21T00:13:37",
"conclusion": "failure",
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
"id": 38,
"is_test": false,
"message": "{\"flow_id\": \"87f2ee15-e023-49d3-8f91-a2814cdf2f0d\", \"deployment_id\": 550813590}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-20T22:59:18",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
"id": 39,
"is_test": false,
"message": "{\"flow_id\": \"d00d8ad0-c390-4d44-934f-3bdd6af155bd\", \"deployment_id\": 551264615}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-21T15:56:51",
"status": "in_progress",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": null,
"conclusion": null,
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
"id": 40,
"is_test": false,
"message": "{\"flow_id\": \"3c630e92-a288-48a8-8b13-0ca74b435c03\", \"deployment_id\": 551283309}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-21T16:27:39",
"status": "in_progress",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-21T18:02:15",
"conclusion": "failure",
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
"id": 41,
"is_test": false,
"message": "{\"flow_id\": \"61f63bf8-fe66-4ce7-93b1-f54712630544\", \"deployment_id\": 551299430}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-21T16:58:23",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-21T21:13:20",
"conclusion": "failure",
"dataset_public_url": null,
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
"id": 43,
"is_test": false,
"message": "{\"flow_id\": \"e47bcd0b-02c4-4698-8a10-a681e101df9c\", \"deployment_id\": 551358781}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-21T18:53:09",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-22T16:41:25",
"conclusion": "success",
"dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-47/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "1bc8d19b6299e727ca7a2e49a3dd038b9c4d45e6",
"id": 47,
"is_test": true,
"message": "{\"flow_id\": \"e3c425be-fb6d-4fae-aaf8-4e0a1af22920\"}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-22T16:35:09",
"status": "completed",
"version": "0.0"
},
{
"bakery_id": 1,
"completed_at": "2022-04-22T22:54:21",
"conclusion": "success",
"dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "32ba8c8f6a639975a1061ece699ac2f053cb8d02",
"id": 48,
"is_test": false,
"message": "{\"flow_id\": \"4083d3c0-679c-4dad-ae18-6a1b96b0076e\", \"deployment_id\": 551919825}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-22T16:42:52",
"status": "completed",
"version": "0.0"
}
],
"spec": "pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock"
}
Also, looking at the catalog, do these datasets' paths follow a particular pattern? if so, is this documented somewhere?
https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-5/pangeo-forge/staged-recipes/noaa-oisst-avhrr-only.zarr
https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-156/pangeo-forge/cmip6-feedstock/CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.Omon.so.gn.v20190429.zarr
https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-8/pangeo-forge/staged-recipes/riops.zarr
Cc @cisaacstern
I presume that some of these datasets are produced during test runs of a recipe but we only want to catalog datasets produced during the production phase, right?
Correct. Furthermore, we only want to catalog SUCCESSFUL production runs.
It is a problem that the version
attribute is not populated correct.
Also, looking at the catalog, do these datasets' paths follow a particular pattern? if so, is this documented somewhere?
Yes, it is documented here: https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0003-standardize-storage-target-layout.md
However, as far as I can tell, we are not following our own specification. Charles can hopefully explain why. I think our thinking has evolved since we wrote ADR-03. My view is now that we should not rely on the dataset_public_url
path at all to encode any important information.
Correct. Furthermore, we only want to catalog SUCCESSFUL production runs.
Great... I had a quick look at https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/981a2bebdfe907ab4bf11393e0d1e3a27149f639/pangeo_forge_orchestrator/models.py#L96 to see what these different attributes are used for but I couldn't figure out which combination of attributes can be used to find out whether a recipe run is a production run.
{
"bakery_id": 1,
"completed_at": "2022-04-22T22:54:21",
"conclusion": "success",
"dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
"dataset_type": "zarr",
"feedstock_id": 3,
"head_sha": "32ba8c8f6a639975a1061ece699ac2f053cb8d02",
"id": 48,
"is_test": false,
"message": "{\"flow_id\": \"4083d3c0-679c-4dad-ae18-6a1b96b0076e\", \"deployment_id\": 551919825}",
"recipe_id": "noaa-coastwatch-geopolar-sst",
"started_at": "2022-04-22T16:42:52",
"status": "completed",
"version": "0.0"
}
which combination of attributes can be used to find out whether a recipe run is a production run.
{
"is_test": false
"status": "completed"
"conclusion": "success"
"dataset_public_url": "some valid url" (i.e., not null)
}
Comment on url formatting to follow...
thank you both for your prompt responses...
Yes, it is documented here: https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0003-standardize-storage-target-layout.md
However, as far as I can tell, we are not following our own specification. Charles can hopefully explain why. I think our thinking has evolved since we wrote ADR-03. My view is now that we should not rely on the dataset_public_url path at all to encode any important information.
AFAICT, we do follow this spec for production runs. It doesn't work for the test runs, because we need to be able to create arbitrary numbers of unique urls for test runs of a given recipe. (And the spec doesn't account for any type of "build number".)
So for test runs, such as those excerpted at the bottom of https://github.com/pangeo-forge/user-stories/issues/1#issuecomment-1149240938, we use an add-hoc format I made up, which includes the recipe run number.
But as Ryan said, the fact that the production runs follow this spec is sort of an anachronism: all of the relevant information is in the recipe run JSON object.
This code is in flux, but FWIW here is where these paths are defined as of today: https://github.com/pangeo-forge/registrar/blob/e501d20fd8c8614d39560af39c1957e209769abb/registrar/flow.py#L125-L141
Rather than having to filter on the front-end, should we add the ability to search and filter the recipe_runs on the back end? We could create an endpoint specifically for that.
Filtering on the front-end may work fine for now, but in the long run, we may have 1000s of recipe runs.
Filtering on the front-end may work fine for now, but in the long run, we may have 1000s of recipe runs.
👍🏽 for filtering on the backend in the future... right now, i am using a simple approach with the assumption that given a feed-stock URL, one is a able to retrieve the entire list of recipe runs without needing to paginate/issue additional API requests.
export function isValidUrl(url) {
try {
new URL(url)
return true
} catch (_) {
return false
}
}
export function isSuccessfulProductionRun(run) {
return (
run.is_test === false &&
run.status === 'completed' &&
run.conclusion === 'success' &&
isValidUrl(run.dataset_public_url)
)
}
export function getDatasets(runs) {
return runs
.filter((run) => isSuccessfulProductionRun(run))
.map((run) => run.dataset_public_url)
}
Yes, backend filtering would be good. And maybe we even want a Javascript client? xref https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/29
In the meantime, as it seems you've discovered Anderson, The extended response of the /feedstocks/{int}
endpoint includes the list of recipe runs associated with just that feedstock, which should be a relatively manageable number for some time to come. (As opposed to the general /recipe_runs
endpoint, which is already starting to be rather long.)
User visits the dashboard page for the feedstock (e.g. pangeo-forge.org/dashboard/feedstock/6) and sees a clear link on this page pointing to a catalog page for the resulting dataset. The catalog page displays a URL and instructions for opening the dataset
I'm currently working on this in https://github.com/pangeo-forge/pangeo-forge.org/pull/93, and i have a couple of questions.
The feedstock page (e.g. https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/dashboard/feedstock/7) has a button/link to a data catalog page
On the data catalog page (e.g. https://pangeo-forge-3drjwzeqq-pangeo-forge.vercel.app/catalog/7), we get a list of datasets.
For this feedstock shown above, we have a list of zarr stores.
xr.open_dataset()
? xr.combine_by_coords()
(xr.open_mfdataset(....)
), etc? are there any valid assumptions about the list of datasets for a particular feedstock? For instance are these datasets going to be compatible with each other i.e. can we combine them via xr.combine_by_coords() (xr.open_mfdataset(....)), etc?
In general, if datasets are compatible with each other, we will have encouraged the recipe contributor to combine them into a single zarr store. So in fact, it should be safe to assume that if a feedstock has multiple zarr stores associated with it, that's because the data within them is non-compatible.
should the instructions mention how to open each dataset one by one via xr.open_dataset()?
Something like this could be nice, though that could certainly be a future PR.
Thanks so much @andersy005 for your work on this important issue! 🚀
The feedstock page (e.g. https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/dashboard/feedstock/7) has a button/link to a data catalog page
Can this say something like "Datasets for this Feedstock", rather than "Data Catalog"? Also, I find the color scheme of the button (green text on black BG) a bit clashy with the rest of the theme
- should the instructions mention how to open each dataset one by one via
xr.open_dataset()
?
I believe we should try to provide some instructions, yes, but I'm not sure of the best UI for this. Any ideas?
- For instance are these datasets going to be compatible with each other i.e. can we combine them...
No. Agree with what Charles said here.
In general, I'd like to see some UI improvements on a page like https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/catalog/7
Name:CMIP6.DAMIP.NCC.NorESM2-LM.hist-aer.r1i1p1f1.Amon.pr.gn.v20190920
, or woa18-1deg-monthly
. I would like to see this as the identifier for the dataset. The URL itself is only relevant for people who want to actually open the dataset..zmetadata
file and show some information about the dataset contents?Thank you for the thorough feedback, @rabernat! I was planning to ping you and @cisaacstern to see what features you'd like to see on that page...
I believe we should try to provide some instructions, yes, but I'm not sure of the best UI for this. Any ideas?
:+1: I'm still looking into some options and will post an update here later today or tomorrow morning
User Profile
As a recipe maintainer
User Action
I want to be able to see where the data produced by my deployed recipe has been deposited
User Goal
so that I can perform data-proximate analysis on the data.
Acceptance Criteria
For a particular feedstock repo (e.g. https://github.com/pangeo-forge/WOA_1degree_monthly-feedstock), after the recipe has been run in production mode, the following should be possible
Linked Issues
No response