pangeo-forge / user-stories

User stories to guide PF development
1 stars 0 forks source link

Link deployed feedstocks to dataset page #1

Open rabernat opened 2 years ago

rabernat commented 2 years ago

User Profile

As a recipe maintainer

User Action

I want to be able to see where the data produced by my deployed recipe has been deposited

User Goal

so that I can perform data-proximate analysis on the data.

Acceptance Criteria

For a particular feedstock repo (e.g. https://github.com/pangeo-forge/WOA_1degree_monthly-feedstock), after the recipe has been run in production mode, the following should be possible

Linked Issues

No response

andersy005 commented 2 years ago

User visits the dashboard page for the feedstock (e.g. pangeo-forge.org/dashboard/feedstock/6) and sees a clear link on this page pointing to a catalog page for the resulting dataset. The catalog page displays a URL and instructions for opening the dataset

Let's say I head over to https://pangeo-forge.org/dashboard/feedstock/3. Querying the api for recipe runs for this feedstock returns a bunch of recipe runs (some successfully completed, others failed). which criteria is used to filter out datasets that are currently listed on the https://pangeo-forge.org/catalog? I presume that some of these datasets are produced during test runs of a recipe but we only want to catalog datasets produced during the production phase, right?

$ http -v https://api.pangeo-forge.org/feedstocks/3
GET /feedstocks/3 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: api.pangeo-forge.org
User-Agent: HTTPie/2.6.0

HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 7805
Content-Type: application/json
Date: Tue, 07 Jun 2022 22:28:04 GMT
Server: uvicorn
Via: 1.1 vegur

{
    "id": 3,
    "provider": "github",
    "recipe_runs": [
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 30,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-19T21:38:22",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 31,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-19T21:53:13",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/tags/1.0",
            "id": 23,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-14T22:54:27",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/tags/1.1",
            "id": 24,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-14T23:06:54",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/tags/1.2",
            "id": 25,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-14T23:15:01",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/tags/1.3",
            "id": 26,
            "is_test": false,
            "message": "{\"flow_id\": \"ebe8c22e-979b-41bb-9c25-d84901c680b0\"}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-14T23:28:02",
            "status": "in_progress",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-15T23:39:56",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "refs/heads/main",
            "id": 28,
            "is_test": false,
            "message": "{\"flow_id\": \"52b91300-1436-4e7b-882e-cf28da6f2335\", \"deployment_id\": 548187443}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-15T23:19:11",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-19T22:13:55",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 32,
            "is_test": false,
            "message": "{\"flow_id\": \"ee424a76-90f6-4201-a94a-1fdb6b4e9de7\", \"deployment_id\": 550120209}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-19T22:02:41",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 33,
            "is_test": false,
            "message": null,
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T00:51:58",
            "status": "queued",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-20T01:23:03",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "a26d7fc5fd6bee2a58b865e748d31c4b95dee60c",
            "id": 34,
            "is_test": false,
            "message": "{\"flow_id\": \"0563d546-0ffe-4b73-b038-36b40592680c\", \"deployment_id\": 550180443}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T00:56:31",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-20T18:10:14",
            "conclusion": "success",
            "dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-35/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "1f3c9a6b6cdca841f0cccf8827005db7be8fa61c",
            "id": 35,
            "is_test": true,
            "message": "{\"flow_id\": \"6b33c556-0770-4c68-accf-69e16ca217a1\"}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T17:35:38",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-20T21:07:32",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 36,
            "is_test": false,
            "message": "{\"flow_id\": \"34beb792-9b70-43f9-b4c5-aa2b9bee7172\", \"deployment_id\": 550680460}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T18:17:29",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-21T00:13:37",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 38,
            "is_test": false,
            "message": "{\"flow_id\": \"87f2ee15-e023-49d3-8f91-a2814cdf2f0d\", \"deployment_id\": 550813590}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-20T22:59:18",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 39,
            "is_test": false,
            "message": "{\"flow_id\": \"d00d8ad0-c390-4d44-934f-3bdd6af155bd\", \"deployment_id\": 551264615}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-21T15:56:51",
            "status": "in_progress",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": null,
            "conclusion": null,
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 40,
            "is_test": false,
            "message": "{\"flow_id\": \"3c630e92-a288-48a8-8b13-0ca74b435c03\", \"deployment_id\": 551283309}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-21T16:27:39",
            "status": "in_progress",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-21T18:02:15",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 41,
            "is_test": false,
            "message": "{\"flow_id\": \"61f63bf8-fe66-4ce7-93b1-f54712630544\", \"deployment_id\": 551299430}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-21T16:58:23",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-21T21:13:20",
            "conclusion": "failure",
            "dataset_public_url": null,
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "8768399762ab6a715b752b749e65a590761a7cd8",
            "id": 43,
            "is_test": false,
            "message": "{\"flow_id\": \"e47bcd0b-02c4-4698-8a10-a681e101df9c\", \"deployment_id\": 551358781}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-21T18:53:09",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-22T16:41:25",
            "conclusion": "success",
            "dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-47/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "1bc8d19b6299e727ca7a2e49a3dd038b9c4d45e6",
            "id": 47,
            "is_test": true,
            "message": "{\"flow_id\": \"e3c425be-fb6d-4fae-aaf8-4e0a1af22920\"}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-22T16:35:09",
            "status": "completed",
            "version": "0.0"
        },
        {
            "bakery_id": 1,
            "completed_at": "2022-04-22T22:54:21",
            "conclusion": "success",
            "dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "32ba8c8f6a639975a1061ece699ac2f053cb8d02",
            "id": 48,
            "is_test": false,
            "message": "{\"flow_id\": \"4083d3c0-679c-4dad-ae18-6a1b96b0076e\", \"deployment_id\": 551919825}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-22T16:42:52",
            "status": "completed",
            "version": "0.0"
        }
    ],
    "spec": "pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock"
}

Also, looking at the catalog, do these datasets' paths follow a particular pattern? if so, is this documented somewhere?

https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-5/pangeo-forge/staged-recipes/noaa-oisst-avhrr-only.zarr
https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-156/pangeo-forge/cmip6-feedstock/CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.Omon.so.gn.v20190429.zarr
https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/prod/recipe-run-8/pangeo-forge/staged-recipes/riops.zarr

Cc @cisaacstern

rabernat commented 2 years ago

I presume that some of these datasets are produced during test runs of a recipe but we only want to catalog datasets produced during the production phase, right?

Correct. Furthermore, we only want to catalog SUCCESSFUL production runs.

It is a problem that the version attribute is not populated correct.

Also, looking at the catalog, do these datasets' paths follow a particular pattern? if so, is this documented somewhere?

Yes, it is documented here: https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0003-standardize-storage-target-layout.md

However, as far as I can tell, we are not following our own specification. Charles can hopefully explain why. I think our thinking has evolved since we wrote ADR-03. My view is now that we should not rely on the dataset_public_url path at all to encode any important information.

andersy005 commented 2 years ago

Correct. Furthermore, we only want to catalog SUCCESSFUL production runs.

Great... I had a quick look at https://github.com/pangeo-forge/pangeo-forge-orchestrator/blob/981a2bebdfe907ab4bf11393e0d1e3a27149f639/pangeo_forge_orchestrator/models.py#L96 to see what these different attributes are used for but I couldn't figure out which combination of attributes can be used to find out whether a recipe run is a production run.

 {
            "bakery_id": 1,
            "completed_at": "2022-04-22T22:54:21",
            "conclusion": "success",
            "dataset_public_url": "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/noaa-coastwatch-geopolar-sst-feedstock/noaa-coastwatch-geopolar-sst.zarr",
            "dataset_type": "zarr",
            "feedstock_id": 3,
            "head_sha": "32ba8c8f6a639975a1061ece699ac2f053cb8d02",
            "id": 48,
            "is_test": false,
            "message": "{\"flow_id\": \"4083d3c0-679c-4dad-ae18-6a1b96b0076e\", \"deployment_id\": 551919825}",
            "recipe_id": "noaa-coastwatch-geopolar-sst",
            "started_at": "2022-04-22T16:42:52",
            "status": "completed",
            "version": "0.0"
        }
cisaacstern commented 2 years ago

which combination of attributes can be used to find out whether a recipe run is a production run.

{
"is_test": false
"status": "completed"
"conclusion": "success"
"dataset_public_url":  "some valid url" (i.e., not null)
}

Comment on url formatting to follow...

andersy005 commented 2 years ago

thank you both for your prompt responses...

cisaacstern commented 2 years ago

Yes, it is documented here: https://github.com/pangeo-forge/roadmap/blob/master/doc/adr/0003-standardize-storage-target-layout.md

However, as far as I can tell, we are not following our own specification. Charles can hopefully explain why. I think our thinking has evolved since we wrote ADR-03. My view is now that we should not rely on the dataset_public_url path at all to encode any important information.

AFAICT, we do follow this spec for production runs. It doesn't work for the test runs, because we need to be able to create arbitrary numbers of unique urls for test runs of a given recipe. (And the spec doesn't account for any type of "build number".)

So for test runs, such as those excerpted at the bottom of https://github.com/pangeo-forge/user-stories/issues/1#issuecomment-1149240938, we use an add-hoc format I made up, which includes the recipe run number.

But as Ryan said, the fact that the production runs follow this spec is sort of an anachronism: all of the relevant information is in the recipe run JSON object.

This code is in flux, but FWIW here is where these paths are defined as of today: https://github.com/pangeo-forge/registrar/blob/e501d20fd8c8614d39560af39c1957e209769abb/registrar/flow.py#L125-L141

rabernat commented 2 years ago

Rather than having to filter on the front-end, should we add the ability to search and filter the recipe_runs on the back end? We could create an endpoint specifically for that.

Filtering on the front-end may work fine for now, but in the long run, we may have 1000s of recipe runs.

andersy005 commented 2 years ago

Filtering on the front-end may work fine for now, but in the long run, we may have 1000s of recipe runs.

👍🏽 for filtering on the backend in the future... right now, i am using a simple approach with the assumption that given a feed-stock URL, one is a able to retrieve the entire list of recipe runs without needing to paginate/issue additional API requests.

export function isValidUrl(url) {
  try {
    new URL(url)
    return true
  } catch (_) {
    return false
  }
}

export function isSuccessfulProductionRun(run) {
  return (
    run.is_test === false &&
    run.status === 'completed' &&
    run.conclusion === 'success' &&
    isValidUrl(run.dataset_public_url)
  )
}

export function getDatasets(runs) {
  return runs
    .filter((run) => isSuccessfulProductionRun(run))
    .map((run) => run.dataset_public_url)
}
cisaacstern commented 2 years ago

Yes, backend filtering would be good. And maybe we even want a Javascript client? xref https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/29

In the meantime, as it seems you've discovered Anderson, The extended response of the /feedstocks/{int} endpoint includes the list of recipe runs associated with just that feedstock, which should be a relatively manageable number for some time to come. (As opposed to the general /recipe_runs endpoint, which is already starting to be rather long.)

andersy005 commented 2 years ago

User visits the dashboard page for the feedstock (e.g. pangeo-forge.org/dashboard/feedstock/6) and sees a clear link on this page pointing to a catalog page for the resulting dataset. The catalog page displays a URL and instructions for opening the dataset

I'm currently working on this in https://github.com/pangeo-forge/pangeo-forge.org/pull/93, and i have a couple of questions.

The feedstock page (e.g. https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/dashboard/feedstock/7) has a button/link to a data catalog page

Screen Shot 2022-06-10 at 1 01 14 PM

On the data catalog page (e.g. https://pangeo-forge-3drjwzeqq-pangeo-forge.vercel.app/catalog/7), we get a list of datasets.

Screen Shot 2022-06-10 at 2 20 40 PM

For this feedstock shown above, we have a list of zarr stores.

  1. should the instructions mention how to open each dataset one by one via xr.open_dataset()?
  2. are there any valid assumptions about the list of datasets for a particular feedstock? For instance are these datasets going to be compatible with each other i.e. can we combine them via xr.combine_by_coords() (xr.open_mfdataset(....)), etc?
cisaacstern commented 2 years ago

are there any valid assumptions about the list of datasets for a particular feedstock? For instance are these datasets going to be compatible with each other i.e. can we combine them via xr.combine_by_coords() (xr.open_mfdataset(....)), etc?

In general, if datasets are compatible with each other, we will have encouraged the recipe contributor to combine them into a single zarr store. So in fact, it should be safe to assume that if a feedstock has multiple zarr stores associated with it, that's because the data within them is non-compatible.

should the instructions mention how to open each dataset one by one via xr.open_dataset()?

Something like this could be nice, though that could certainly be a future PR.

rabernat commented 2 years ago

Thanks so much @andersy005 for your work on this important issue! 🚀

The feedstock page (e.g. https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/dashboard/feedstock/7) has a button/link to a data catalog page

Can this say something like "Datasets for this Feedstock", rather than "Data Catalog"? Also, I find the color scheme of the button (green text on black BG) a bit clashy with the rest of the theme

  1. should the instructions mention how to open each dataset one by one via xr.open_dataset()?

I believe we should try to provide some instructions, yes, but I'm not sure of the best UI for this. Any ideas?

  1. For instance are these datasets going to be compatible with each other i.e. can we combine them...

No. Agree with what Charles said here.

In general, I'd like to see some UI improvements on a page like https://pangeo-forge-8ec7uuy0a-pangeo-forge.vercel.app/catalog/7

andersy005 commented 2 years ago

Thank you for the thorough feedback, @rabernat! I was planning to ping you and @cisaacstern to see what features you'd like to see on that page...

andersy005 commented 2 years ago

I believe we should try to provide some instructions, yes, but I'm not sure of the best UI for this. Any ideas?

:+1: I'm still looking into some options and will post an update here later today or tomorrow morning