pangeo-forge / roadmap

Pangeo Forge public roadmap
Creative Commons Attribution 4.0 International
19 stars 4 forks source link

Need for common vocabulary/visibility of work related to high-level concepts #9

Open ciaransweet opened 3 years ago

ciaransweet commented 3 years ago

TL;DR:

We need to ensure our vocabulary is consistent throughout projects and that all planned/current work is visible. This is to ensure that development is done towards the right issues and people on-boarding onto any pangeo-forge projects can easily identify the current state of work.

The long bit, time to read

Hey folks, please let me know if this is better placed somewhere else. Just doing a little dump of my thoughts after going through the repos and project boards.

I've gone through the following: pangeo-forge/pangeo-forge, pangeo-forge/roadmap, pangeo-forge/staged-recipes, and https://pangeo-forge.readthedocs.io/en/latest/. Whilst on-boarding and getting my head around pangeo-forge I've noticed a pretty big gap/discrepancy in how we're talking about Recipes as a whole.

This gap (we'll call it that) is mainly down to what a Recipe actually is, I'll try to outline where I see the confusion:

Docs

The documentation says that a Recipe defines how to transform data in one format / location into another format / location - which makes complete sense šŸ’Æ

What confuses the picture (for me at least) currently, is going to Recipes and seeing that a Recipe is defined by instantiating a class that inherits from pangeo_forge.recipe.BaseRecipe (which still makes sense), but in Recipe Tutorials the documentation here is more 'how to use a Recipe', rather than write one. I think a first 'quick win' could be to rename this to 'Using a Recipe'.

I should note, Recipe Execution covers the 'Using a Recipe', so maybe the contents of the former, would be better in here?

I think the most confusion for me currently comes from Contributing a Recipe which mentions Pipelines (which aren't currently mentioned on the actual Recipe pages). My understanding from the meetings I've been in and what I've seen of refactoring etc, leads me to think that the idea of a Pipeline is now not necessarily related to a Recipe, a Recipe should just relate to taking an input, transforming it in some manner, then serialising it somewhere else. I've assumed that a Recipe, once written, shouldn't need knowledge of Prefect etc for contribution to pangeo-forge.

Whilst I appreciate that Recipes does clearly state The Recipe API is still in flux and may change. Make sure the version of the documentation you are reading matches your installed version of pangeo_forge. - I think this should be on both the pangeo-forge repo and the staged-recipes repo, as I think folks might be implementing the wrong/an old solution?

Repos

Staged Recipes looks like it could do with some TLC, I'm sure this is known, but it should probably reside in an Issue, which I've not seen yet. It might be an Issue, I just didn't notice it on the Project Boards I've looked at. It still mentions making Pipelines (what are those?) and also introduces a metal.yaml file too, that's new to me! (If I'm playing devils advocate of someone who's just going by documentation).

Not to beat a dead šŸ“ but example/pipeline.py is a great example of where my confusion comes in. No where has the mention of a Recipe or something that looks like it implements pangeo_forge.recipe.BaseRecipe, to me (If I was to look at contributing) this then causes a real blocker, as I don't know which Recipe is the source of truth.

The BaseRecipe class is where I see the source of truth for what a Recipe currently is, NetCDFtoZarrSequentialRecipe is a clear example of the implementation of this and I'd assume this is what any contributed Recipe would look like. If that's the case, I think we need to make it immediately obvious that this area is in flux, even if it's a 'Don't try to contribute yet, we're just lining up stuff'.

I noticed (and sorry for the name drop šŸ˜…) that @davidbrochart has started working on a GPM IMERG Recipe - This to me, is not a Recipe, it's a script/orchestration of a already available Recipe. This seems like it falls into another category of contributions/work we need.

Concluding thoughts

Firstly, a disclaimer - I'm pretty damn new to pangeo-forge so might just be oblivious to something that's key šŸ¤£. This is also by no means a rant/attack on the work done so far! I'm super excited to see where this goes! ā¤ļø

To wrap up, I think that we need to:

Please do interrogate this, tell me I'm wrong etc. Looking forward to what folks think!

cc @pangeo-forge/dev-team

rabernat commented 3 years ago

Thanks for this useful feedback @ciaranevans. I agree things are a bit of a mess right now. I believe that most of the confusion comes from the fact that the recent massive refactoring of this project (https://github.com/pangeo-forge/pangeo-forge/pull/27) has still not propagated throughout all of the related repos and documentation. That refactor changed many of the terms / definitions. I agree that clarifying these should be a top priority.

I think the most confusion for me currently comes from Contributing a Recipe which mentions Pipelines (which aren't currently mentioned on the actual Recipe pages).

This entire page was written before the refactor and is now almost totally obsolete.

that @davidbrochart has started working on a GPM IMERG Recipe. - This to me, is not a Recipe, it's a script/orchestration of a already available Recipe.

In analogy with Conda Forge, we really imagine that the recipe is the specific thing that applies to a specific dataset. The NetCDFtoZarrSequentialRecipe class can be used to instantiate dozens of distinct recipes.

So would it be helpful if we defined:

?

They would be related as follows: all recipes are instances of Recipe Classes.

I am definitely open to different terminology that you think would be less confusing.

rabernat commented 3 years ago

p.s. "pipeline" is an internal concept--a technical detail about how recipes describe themselves. I think it should not really be part of the user-facing documentation, but it is important for developers to understand.

ciaransweet commented 3 years ago

Hey @rabernat

The NetCDFtoZarrSequentialRecipe class can be used to instantiate dozens of distinct recipes.

So would it be helpful if we defined:

recipe - a specification of how to transform a specific dataset in one format / location into another format / location Recipe Class - a general template for transforming a type of data in one format / location into another format / location

I think this is where we need to look into our terminology, I think Recipe is too wide a catch-all. If we're going with the culinary theme (and forgive me for using Wikipedia šŸ¤£ ) then maybe we make something like NetCDFtoZarrSequentialRecipe actually a 'procedure', see Components of modern recipes. I don't think 'utensil' is quite the right word.

In the example of @davidbrochart's PR, I'd say then that his recipe.py is a Recipe, what it contains is actually the Procedure (maybe we can call them transformers? - not very culinary) which is just a step in the Recipe.

all recipes are instances of Recipe Classes

Doesn't work for things that are 'scripts' that do many things and use one of the Recipe classes.

In an OOP world, your recipe would be the 'script' part, you're then using the NetCDFtoZarrProcedure (which is generic in that it just takes an input, regardless of the dataset, and puts it in the output location you specify). You NetCDFtoZarrProcedure would be an implementation of your Procedure interface (which is currently pangeo_forge.recipe.BaseRecipe) in my eyes...

I think.. šŸ˜…

rabernat commented 3 years ago

Please do not draw any conclusions from @davidbrochart's recipe. It does not accurately represent our vision for how this will work. It mixes together the recipe itself and its execution. It was written by David without really discussing or consulting with the broader group. Going forward, we need to work together to define how recipes will be specified, and this discussion is useful for that. (See also https://github.com/pangeo-forge/pangeo-forge/pull/71).

Doesn't work for things that are 'scripts' that do many things and use one of the Recipe classes.

We don't want to be maintaining "scripts." We want to be maintaining recipes. The idea is that the instantiated recipe object contains everything needed to produce a dataset inside itself. No extra "script" should be necessary. However, some code may need to be written to instantiate the recipe. I imagine that the contents of a recipe.py file for a specific recipe would look something like this

url_list = ['http://foo.com/bar1.nc', 'http://foo.com/bar2.nc']

def my_preprocessor(ds):
    return ds**2

recipe = NetCDFtoZarrSequentialRecipe(
    input_urls=url_list,
    preprocess_chunks=my_preprocessor
)

This is the code that would actually live in a recipe repe (e.g. pangeo-forge/foo-recipe). The bakery would then just import the recipe object from this file, assign the appropriate bakery-specific storage targets, and execute it.

Does that make sense to you?

Regarding the terminology, have you looked in the the usage of "recipe" in conda forge? That is what we are basing things on here: https://conda-forge.org/docs/maintainer/adding_pkgs.html?highlight=recipe

martindurant commented 3 years ago

We don't want to be maintaining "scripts." We want to be maintaining recipes. The idea is that the instantiated recipe object contains everything needed to produce a dataset inside itself.

Indeed, in conda-forge, usually everything is contained in a single YAML file, including metadata about the repo and build instructions. In their case, the recipe and that file and mostly synonymous.

Even the little example above could be more declarative, but I suppose the best description of code (such as the preprocess function) really is code.

rabernat commented 3 years ago

Good points martin. Where there is no custom code involved, we could easily implement a yaml syntax and translator that would create the recipe, e.g.

recipe:
  class: NetCDFtoZarrSequentialRecipe
  input_urls:
    - 'http://foo.com/bar1.nc'
    - 'http://foo.com/bar1.nc'

I had imagined eventually supporting this, for people who don't want to write python.

ciaransweet commented 3 years ago

This is the code that would actually live in a recipe repe (e.g. pangeo-forge/foo-recipe). The bakery would then just import the recipe object from this file, assign the appropriate bakery-specific storage targets, and execute it.

Does that make sense to you?

Yeah, that makes more sense to me in how they're used (I guess another ticket to raise to point this intention out)

The statement that the recipe.py must have a recipe object of a type based on BaseRecipe declared globally is much more succinct and obvious.

I also like the above yaml syntax, if it's a straight in/out job

Thanks for the inputs @rabernat @martindurant

rabernat commented 3 years ago

Really helpful to have your feedback. This is all super useful to talk through.

The next step is to get this all written down in https://github.com/pangeo-forge/pangeo-forge/pull/71

ciaransweet commented 3 years ago

Hopefully I've not just missed the train overall and gone on a rant šŸ˜…

ciaransweet commented 3 years ago

@rabernat just so I can šŸ¤“ -up on the project some more, is there a good place to look into how those recipe.py files will then be used to actually get the executions running?

From a pangeo-forge sense, rather than locally invoking it

rabernat commented 3 years ago

is there a good place to look into how those recipe.py files will then be used to actually get the executions running?

Nope! We still have to sort that part out, and it is quite a big task. However, I imagine it would look roughly like this. This code would run inside a github workflow.

from recipe_file import recipe

# some parts of the recipe need to be customized by the specific bakery where it will run
# this part needs to be figured out more
bakery = determine_bakery(recipe)
recipe.target target = bakery.target
recipe.input_cache = bakery.input_cache

# convert recipe to a prefect flow
from pangeo_forge.executors import PrefectPipelineExecutor
executor = PrefectPipelineExecutor()  # will need more config
plan = executor.pipelines_to_plan(recipe.to_pipelines())

# we now have a prefect flow (plan)
# next step is to submit it to prefect cloud and make sure it gets routed to the correct bakery

Some of that could go inside a CLI (see https://github.com/pangeo-forge/pangeo-forge/issues/43).

ciaransweet commented 3 years ago

Cool thanks, makes sense to me

ciaransweet commented 3 years ago

(Sorry for question upon question) - Do we have a test suite/expectation of tests for folks' recipes?

rabernat commented 3 years ago

Do we have a test suite/expectation of tests for folks' recipes?

Could you elaborate what you mean here? The recipe classes are tested in pangeo-forge: https://github.com/pangeo-forge/pangeo-forge/blob/master/tests/test_recipe.py

Are you talking about validation of contributed recipes?

rabernat commented 3 years ago

I'd also like to note that that the recipe repo may need an additional metadata file, besides the recipe itself. This would specify things like the maintainer(s), data license, target bakery, etc.

ciaransweet commented 3 years ago

I am yes, I.E if someone wants to contribute a Recipe what is the expectation on that contribution in terms of testing?

Do we expect it to come with unit tests?

rabernat commented 3 years ago

No, I don't think we should require users to write customized tests for their recipe. Because the recipes all follow a fixed pattern, it should be possible to write a generic validator that is run as part of the recipe contribution process.

ciaransweet commented 3 years ago

Interesting.

I feel like we want ensure the data out is what's expected.. especially for custom recipes? Is that what your generic validator would do?

For example, if I had dataset X and wanted to generate COGs (with specified parameters) from it, how do we ensure that we have a valid dataset output before we potentially burn a lot of šŸ’° on a recipe that isn't correct?

ciaransweet commented 3 years ago

It might 'work' functionally, but it might just be making junk (which is definitely a possibility)

rabernat commented 3 years ago

I feel like we want ensure the data out is what's expected.. especially for custom recipes? Is that what your generic validator would do?

Yeah that's what I imagined. For example, the validator could do the following:

At this point, the recipe "works" as in does not error. But how do you verify the data are "right"? I see two options (these are not mutually exclusive):

  1. Write a formal test (your suggestion).
  2. Manually inspect the output data.

I have a hard time imagining what 1 would actually look like, but I'm definitely open to the idea. What might a formal verification test look like for the COGs you generate?

For 2, assuming this is all happening as part of a PR workflow, we could have a bot post to the PR with a link to the validation output data. Then the user could manually inspect the data.

ciaransweet commented 3 years ago

Well, I'd assume that someone writing a Recipe to take X and make it into Y, would have some knowledge of both and be able to provide at least a unit test clarifying their dataset does indeed look how they want.

I think manual inspection puts a huge onus on either pangeo-forge to make sure the Recipes it accepts are 'good' or on the contributor to have to intervene and go 'yes my contribution is OK'.

Anything manual is a bit of a smell in that sense. For example, there are libraries (rio-cogeo) for example that can validate a COG, you can also open them with rasterio and assert stuff like NODATA values, CRS' are correct.

Personally, I believe a contributor should provide a unit test that confirms their recipe outputs data correctly, and doesn't just not error. If I was hosting a bakery, I'd rather not be spending lots of money hosting dud datasets. I think an expectation of provided unit-tests in the grand scheme of things is to be expected for contributions to OSS projects?

ciaransweet commented 3 years ago

Otherwise, Bakeries will need to manually check each recipe before they say 'yes, you can run your recipe in our bakery'

rabernat commented 3 years ago

This is a useful discussion and I really appreciate you view.

I would distinguish between a general OSS contributor and a Pangeo Forge recipe contributor. Our goal here is to really democratize the production of ARCO data in the cloud. I imagine we will have contributions from data managers and individual scientists who don't really know python at all (thus the potential yaml syntax). We want to make it easy for them. Requiring writing a custom python unit test for all recipes will significantly raise the bar for contribution.

Anything manual is a bit of a smell in that sense. For example, there are libraries (rio-cogeo) for example that can validate a COG, you can also open them with rasterio and assert stuff like NODATA values, CRS' are correct.

I agree that it's important to have validation for stuff like this. But I wonder if this always needs to be done at the individual recipe level. Assuming you are using a standard recipe class, and that class itself is well tested (accurately propagates data and metadata), can we not trust that it will work as expected?

I see this sort of individual-recipe testing as especially important when the recipe has lots of custom processing code.

I am not opposed to the idea of recipe tests. But I do wonder whether they are always needed, especially in simple cases.

andersy005 commented 3 years ago

I am not opposed to the idea of recipe tests. But I do wonder whether they are always needed, especially in simple cases.

šŸ‘šŸ½ I'm also in favor of making the recipe tests optional and maybe providing some user guides (in form of examples) of how one could go about writing custom tests for their recipes if need be.

ciaransweet commented 3 years ago

I think anything that is run, should have some form of coverage. I am pretty 'adamant' about that.

Standard pangeo-forge provided recipes probably just need a smoke test with your dataset to make sure it doesn't fall over with the provided input/outputs.

But contributing a new Recipe class, that needs unit tests at the class implementation level, no ifs/buts.

I guess this is another case of the jump from recipe to Recipe Class. A recipe that just imports a provided class will need a lot less testing.

But if i'm contributing a recipe.py and a CiaransMagicalRecipe class, I'd want that unit tested to ensure that CiaransMagicalRecipe isn't just making giant COGs with no data in them šŸ˜…

In the culinary vocab, you should be able to prove your dataset bakes correctly, before handing it to the bakery. That should reside with the recipe, not the bakery.

rabernat commented 3 years ago

But contributing a new Recipe class, that needs unit tests at the class implementation level, no ifs/buts.

Absolutely! I would never disagree with that. Our goal is to have 100% test coverage in pangeo-forge. But a new Recipe class would be submitted to the pangeo-forge repo itself. That is different from just contributing a new recipe.

But some ambiguity creeps in if we allow recipes to define custom recipe classes e.g.

class CiaransMagicalRecipe(NetCDFtoZarrSequentialRecipe):
...

recipe = CiaransMagicalRecipe()

We should either disallow this completely or else definitely require complete test coverage.

wildintellect commented 3 years ago

I get what @ciaranevans is saying, in the context of End Users submitting recipes.

  1. Write a Recipe, include some config on how to verify the outputs (using methods available in the code)
  2. PR the Recipe
  3. CI/CD Validate the config
  4. Run a sample of the data specified in the above config and verify it's structure and values.
  5. Once step 4 passes, the whole dataset can be processed (this can be triggered automatically or manually)
  6. Run a verification on sample from the output (also using built in methods)

This is all separate from code contributions of functions and classes that become the methods used. All of these should have unit tests that CI/CD runs on commits/PRs to the code repos.

ciaransweet commented 3 years ago

@rabernat with all the above discussions, I think that maybe in the next co-ordination meeting we should try and do a whiteboard session (we've used Miro here at DevSeed before, maybe we could do guest invites then export the resultant whiteboard for documentation) to outline a 'steel thread' through pangeo-forge, the recipes and bakeries.

A example that touches bases on all our high level use-cases, that we can then use as both a way to test and work out our assumptions and to act as a 'golden' example of what's expected.

rabernat commented 3 years ago

A example that touches bases on all our high level use-cases, that we can then use as both a way to test and work out our assumptions and to act as a 'golden' example of what's expected.

I have a certain example in my head that I have been using. I just submitted it as a draft PR in https://github.com/pangeo-forge/staged-recipes/pull/20.

rabernat commented 3 years ago

In 5aa5aca I pushed an update to the roadmap based on this discussion. Sorry for not PR-ing, but I wanted to get an update to coincide with my talk tomorrow at the ECMWF Virtual workshop on Weather and climate in the cloud. We can continue to iterate this after this next coordination meeting.