pangeo-forge / pangeo-forge-runner

Run pangeo-forge recipes on Apache Beam
https://pangeo-forge-runner.readthedocs.io
Apache License 2.0
9 stars 9 forks source link

Split this project into two #27

Open yuvipanda opened 2 years ago

yuvipanda commented 2 years ago

Based on https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/115#issuecomment-1247479613, and https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/115#issuecomment-1246057157 I think we need to split this project into two.

Part 1

This should be responsible for:

  1. Fetching the appropriate feedstock from whevever (GitHub, Zenodo, etc) onto the local filesystem
  2. Creating an appropriate environment for the recipe to be parsed and to run. This can be via conda or via docker, must be pluggable

Most importantly, there should be no arbitrary code execution here. So it can read meta.yaml (carefully hehe) but not exec any .py files. This is what the orchestrator will call.

It will also not have any ties into the version of pangeo-forge-recipes needed for use by the appropriate feedstock.

Part 2

This should be responsible for actually executing arbitrary user code (in recipe.py file). This will be run in the environment created by part 1, and can be tied to a specific version of pangeo-forge-recipes. This part will be a separate python package, and should be installed in the environment created for it by part 1.

Open questions

sharkinsspatial commented 2 years ago

@cisaacstern As requested, just referencing our recent experiences trying to incorporate arbitrary third party libs (https://github.com/nsidc/earthdata) while creating recipes for NASA datasets which require Earth Data Login (EDL) authentication with sessions rather than simple basic authentication due to the endpoint http redirects which occur when running the recipe in us-west-2.

This is a good example of some of the use cases discussed in https://github.com/pangeo-forge/pangeo-forge-orchestrator/issues/115#issuecomment-1246057157.