pangeo-forge / pangeo-forge-recipes

Python library for building Pangeo Forge recipes.
https://pangeo-forge.readthedocs.io/
Apache License 2.0
118 stars 54 forks source link

How reusable are our sequential functions (e.g., in Flyte, Bytewax, etc.)? #621

Open cisaacstern opened 9 months ago

cisaacstern commented 9 months ago

Recent discussion with @ljstrnadiii got me wondering how reusable our sequential functions are outside the Beam context. In general, we aim to follow this Beam programming guide best practice:

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/c2927775cf0f992be4b93153028c345ca4f7c14c/pangeo_forge_recipes/transforms.py#L37-L43

In theory, this means those parts of our code could be wrapped in some other, non-Beam, parallelization framework, such as Flyte (a task orchestrator, which Len has experience with), or possibly Bytewax (another dataflow model, which has come up in our Coordination meetings). In practice, I'm not sure how difficult this would be.

Opening this issue for further discussion, particularly as a place for ongoing discussion with Len re: Flyte, but also on this subject more generally. The maximalist approach to this question would be to ask what it would take to actually support various data-parallel interfaces in Pangeo Forge. Having just come off the major Beam refactor effort, I think it's fair to say we don't have the appetite for that just yet, but big picture that's not entirely off the table. For the near term, I'm thinking more along the lines of supporting others to do this wrapping themselves.

alxmrs commented 8 months ago

To add my 2¢: just like Dask, I think the best abstraction would be to contribute Flyte or Bytewax runners to the Beam project.