Link deployed feedstocks to dataset page

rabernat commented 2 years ago

User Profile

As a recipe maintainer

User Action

I want to be able to see where the data produced by my deployed recipe has been deposited

User Goal

so that I can perform data-proximate analysis on the data.

Acceptance Criteria

For a particular feedstock repo (e.g. https://github.com/pangeo-forge/WOA_1degree_monthly-feedstock), after the recipe has been run in production mode, the following should be possible

[ ] User visits the dashboard page for the feedstock (e.g. https://pangeo-forge.org/dashboard/feedstock/6) and sees a clear link on this page pointing to a catalog page for the resulting dataset. The catalog page displays a URL and instructions for opening the dataset
[ ] User visits the GitHub repo of a feedstock. The deployments link can be followed to find the associated catalog page.

Linked Issues

No response

andersy005 commented 2 years ago

User visits the dashboard page for the feedstock (e.g. pangeo-forge.org/dashboard/feedstock/6) and sees a clear link on this page pointing to a catalog page for the resulting dataset. The catalog page displays a URL and instructions for opening the dataset

Let's say I head over to https://pangeo-forge.org/dashboard/feedstock/3. Querying the api for recipe runs for this feedstock returns a bunch of recipe runs (some successfully completed, others failed). which criteria is used to filter out datasets that are currently listed on the https://pangeo-forge.org/catalog? I presume that some of these datasets are produced during test runs of a recipe but we only want to catalog datasets produced during the production phase, right?

$ http -v https://api.pangeo-forge.org/feedstocks/3
GET /feedstocks/3 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: api.pangeo-forge.org
User-Agent: HTTPie/2.6.0

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 7805
Content-Type: application/json
Date: Tue, 07 Jun 2022 22:28:04 GMT
Server: uvicorn
Via: 1.1 vegur

{
    "id": 3,
    "provider": "github",
    "recipe_runs": [
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 30,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-19T21:38:22",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 31,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-19T21:53:13",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/tags/1.0",
            "id": 23,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-14T22:54:27",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/tags/1.1",
            "id": 24,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-14T23:06:54",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/tags/1.2",
            "id": 25,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-14T23:15:01",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/tags/1.3",
            "id": 26,
            "is_test": false,
            "message": "{\"flow_id\": \"ebe8c22e-979b-41bb-9c25-d84901c680b0\"}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-14T23:28:02",
            "status": "in_progress",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-15T23:39:56",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/heads/main",
            "id": 28,
            "is_test": false,
            "message": "{\"flow_id\": \"52b91300-1436-4e7b-882e-cf28da6f2335\", \"deployment_id\": 548187443}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-15T23:19:11",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-19T22:13:55",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 32,
            "is_test": false,
            "message": "{\"flow_id\": \"ee424a76-90f6-4201-a94a-1fdb6b4e9de7\", \"deployment_id\": 550120209}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-19T22:02:41",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 33,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T00:51:58",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-20T01:23:03",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 34,
            "is_test": false,
            "message": "{\"flow_id\": \"0563d546-0ffe-4b73-b038-36b40592680c\", \"deployment_id\": 550180443}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T00:56:31",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-20T18:10:14",
            "conclusion": "success",
            "dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-35/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "1f3c9a6b6cdca841f0cccf8827005db7be8fa61c",
            "id": 35,
            "is_test": true,
            "message": "{\"flow_id\": \"6b33c556-0770-4c68-accf-69e16ca217a1\"}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T17:35:38",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-20T21:07:32",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 36,
            "is_test": false,
            "message": "{\"flow_id\": \"34beb792-9b70-43f9-b4c5-aa2b9bee7172\", \"deployment_id\": 550680460}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T18:17:29",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-21T00:13:37",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 38,
            "is_test": false,
            "message": "{\"flow_id\": \"87f2ee15-e023-49d3-8f91-a2814cdf2f0d\", \"deployment_id\": 550813590}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T22:59:18",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 39,
            "is_test": false,
            "message": "{\"flow_id\": \"d00d8ad0-c390-4d44-934f-3bdd6af155bd\", \"deployment_id\": 551264615}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-21T15:56:51",
            "status": "in_progress",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 40,
            "is_test": false,
            "message": "{\"flow_id\": \"3c630e92-a288-48a8-8b13-0ca74b435c03\", \"deployment_id\": 551283309}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-21T16:27:39",
            "status": "in_progress",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-21T18:02:15",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 41,
            "is_test": false,
            "message": "{\"flow_id\": \"61f63bf8-fe66-4ce7-93b1-f54712630544\", \"deployment_id\": 551299430}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-21T16:58:23",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-21T21:13:20",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 43,
            "is_test": false,
            "message": "{\"flow_id\": \"e47bcd0b-02c4-4698-8a10-a681e101df9c\", \"deployment_id\": 551358781}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-21T18:53:09",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-22T16:41:25",
            "conclusion": "success",
            "dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-47/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "1bc8d19b6299e727ca7a2e49a3dd038b9c4d45e6",
            "id": 47,
            "is_test": true,
            "message": "{\"flow_id\": \"e3c425be-fb6d-4fae-aaf8-4e0a1af22920\"}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-22T16:35:09",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-22T22:54:21",
            "conclusion": "success",
            "dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "32ba8c8f6a639975a1061ece699ac2f053cb8d02",
            "id": 48,
            "is_test": false,
            "message": "{\"flow_id\": \"4083d3c0-679c-4dad-ae18-6a1b96b0076e\", \"deployment_id\": 551919825}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-22T16:42:52",
            "status": "completed",
            "version": "0.0"
        }
    ],
    "spec": "pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock"
}

Also, looking at the catalog, do these datasets' paths follow a particular pattern? if so, is this documented somewhere?

https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-5/pangeo-forge/staged-recipes/noaa-oisst-avhrr-only.zarr
https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-156/pangeo-forge/cmip6-feedstock/CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.Omon.so.gn.v20190429.zarr
https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-8/pangeo-forge/staged-recipes/riops.zarr

Cc @cisaacstern

rabernat commented 2 years ago

I presume that some of these datasets are produced during test runs of a recipe but we only want to catalog datasets produced during the production phase, right?

Correct. Furthermore, we only want to catalog SUCCESSFUL production runs.

It is a problem that the version attribute is not populated correct.

Also, looking at the catalog, do these datasets' paths follow a particular pattern? if so, is this documented somewhere?

Yes, it is documented here: https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0003-standardize-storage-target-layout.md

However, as far as I can tell, we are not following our own specification. Charles can hopefully explain why. I think our thinking has evolved since we wrote ADR-03. My view is now that we should not rely on the dataset_public_url path at all to encode any important information.

andersy005 commented 2 years ago

Correct. Furthermore, we only want to catalog SUCCESSFUL production runs.

Great... I had a quick look at https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/981a2bebdfe907ab4bf11393e0d1e3a27149f639/pangeo_forge_orchestrator/models.py#L96 to see what these different attributes are used for but I couldn't figure out which combination of attributes can be used to find out whether a recipe run is a production run.

 {
            "bakery_id": 1,
            "completed_at": "2022-04-22T22:54:21",
            "conclusion": "success",
            "dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "32ba8c8f6a639975a1061ece699ac2f053cb8d02",
            "id": 48,
            "is_test": false,
            "message": "{\"flow_id\": \"4083d3c0-679c-4dad-ae18-6a1b96b0076e\", \"deployment_id\": 551919825}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-22T16:42:52",
            "status": "completed",
            "version": "0.0"
        }

cisaacstern commented 2 years ago

which combination of attributes can be used to find out whether a recipe run is a production run.

{
"is_test": false
"status": "completed"
"conclusion": "success"
"dataset_public_url":  "some valid url" (i.e., not null)
}

Comment on url formatting to follow...

andersy005 commented 2 years ago

thank you both for your prompt responses...

cisaacstern commented 2 years ago

Yes, it is documented here: https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0003-standardize-storage-target-layout.md

However, as far as I can tell, we are not following our own specification. Charles can hopefully explain why. I think our thinking has evolved since we wrote ADR-03. My view is now that we should not rely on the dataset_public_url path at all to encode any important information.

AFAICT, we do follow this spec for production runs. It doesn't work for the test runs, because we need to be able to create arbitrary numbers of unique urls for test runs of a given recipe. (And the spec doesn't account for any type of "build number".)

So for test runs, such as those excerpted at the bottom of https://github.com/pangeo-forge/user-stories/issues/1#issuecomment-1149240938, we use an add-hoc format I made up, which includes the recipe run number.

But as Ryan said, the fact that the production runs follow this spec is sort of an anachronism: all of the relevant information is in the recipe run JSON object.

This code is in flux, but FWIW here is where these paths are defined as of today: https://github.com/pangeo-forge/registrar/blob/e501d20fd8c8614d39560af39c1957e209769abb/registrar/flow.py#L125-L141

rabernat commented 2 years ago

Rather than having to filter on the front-end, should we add the ability to search and filter the recipe_runs on the back end? We could create an endpoint specifically for that.

Filtering on the front-end may work fine for now, but in the long run, we may have 1000s of recipe runs.

andersy005 commented 2 years ago

Filtering on the front-end may work fine for now, but in the long run, we may have 1000s of recipe runs.

👍🏽 for filtering on the backend in the future... right now, i am using a simple approach with the assumption that given a feed-stock URL, one is a able to retrieve the entire list of recipe runs without needing to paginate/issue additional API requests.

export function isValidUrl(url) {
  try {
    new URL(url)
    return true
  } catch (_) {
    return false
  }
}

export function isSuccessfulProductionRun(run) {
  return (
    run.is_test === false &&
    run.status === 'completed' &&
    run.conclusion === 'success' &&
    isValidUrl(run.dataset_public_url)
  )
}

export function getDatasets(runs) {
  return runs
    .filter((run) => isSuccessfulProductionRun(run))
    .map((run) => run.dataset_public_url)
}

cisaacstern commented 2 years ago

Yes, backend filtering would be good. And maybe we even want a Javascript client? xref https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/29

In the meantime, as it seems you've discovered Anderson, The extended response of the /feedstocks/{int} endpoint includes the list of recipe runs associated with just that feedstock, which should be a relatively manageable number for some time to come. (As opposed to the general /recipe_runs endpoint, which is already starting to be rather long.)

andersy005 commented 2 years ago

User visits the dashboard page for the feedstock (e.g. pangeo-forge.org/dashboard/feedstock/6) and sees a clear link on this page pointing to a catalog page for the resulting dataset. The catalog page displays a URL and instructions for opening the dataset

I'm currently working on this in https://github.com/pangeo-forge/pangeo-forge.org/pull/93, and i have a couple of questions.

The feedstock page (e.g. https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/dashboard/feedstock/7) has a button/link to a data catalog page

On the data catalog page (e.g. https://pangeo-forge-3drjwzeqq-pangeo-forge.vercel.app/catalog/7), we get a list of datasets.

For this feedstock shown above, we have a list of zarr stores.

should the instructions mention how to open each dataset one by one via xr.open_dataset()?
are there any valid assumptions about the list of datasets for a particular feedstock? For instance are these datasets going to be compatible with each other i.e. can we combine them via xr.combine_by_coords() (xr.open_mfdataset(....)), etc?

cisaacstern commented 2 years ago

are there any valid assumptions about the list of datasets for a particular feedstock? For instance are these datasets going to be compatible with each other i.e. can we combine them via xr.combine_by_coords() (xr.open_mfdataset(....)), etc?

In general, if datasets are compatible with each other, we will have encouraged the recipe contributor to combine them into a single zarr store. So in fact, it should be safe to assume that if a feedstock has multiple zarr stores associated with it, that's because the data within them is non-compatible.

should the instructions mention how to open each dataset one by one via xr.open_dataset()?

Something like this could be nice, though that could certainly be a future PR.

rabernat commented 2 years ago

Thanks so much @andersy005 for your work on this important issue! 🚀

The feedstock page (e.g. https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/dashboard/feedstock/7) has a button/link to a data catalog page

Can this say something like "Datasets for this Feedstock", rather than "Data Catalog"? Also, I find the color scheme of the button (green text on black BG) a bit clashy with the rest of the theme

should the instructions mention how to open each dataset one by one via xr.open_dataset()?

I believe we should try to provide some instructions, yes, but I'm not sure of the best UI for this. Any ideas?

For instance are these datasets going to be compatible with each other i.e. can we combine them...

No. Agree with what Charles said here.

In general, I'd like to see some UI improvements on a page like https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/catalog/7

This page should strive to do more than just providing a list of URLs.
Each recipe has an ID, e.g. Name:CMIP6.DAMIP.NCC.NorESM2-LM.hist-aer.r1i1p1f1.Amon.pr.gn.v20190920, or woa18-1deg-monthly. I would like to see this as the identifier for the dataset. The URL itself is only relevant for people who want to actually open the dataset.
For now we can assume that all the datasets are public Zarr accessible over HTTP. So could we actual introspect into the .zmetadata file and show some information about the dataset contents?
It would also be nice to show some information about the provenance and maintainers of the datasets. This information is available in the meta.yaml file. Not sure how feasible this is.

andersy005 commented 2 years ago

Thank you for the thorough feedback, @rabernat! I was planning to ping you and @cisaacstern to see what features you'd like to see on that page...

andersy005 commented 2 years ago

I believe we should try to provide some instructions, yes, but I'm not sure of the best UI for this. Any ideas?

:+1: I'm still looking into some options and will post an update here later today or tomorrow morning

pangeo-forge / user-stories