Open jhamman opened 4 years ago
I think that using Prefect's Parameters is probably best. They make things a bit more complicated to debug outside of a Flow
context, but I don't think people will be doing that anyway. We'll need good documentation & examples on debugging within a flow context.
I have an example at https://github.com/TomAugspurger/noaa-oisst-avhrr-feedstock/blob/5aa7b9007d1055c4e03306b87358ac916d559e59/recipe/pipeline.py. A few things to note:
Pipeline
. There's nothing special about them being there though (indeed, the Pipeline
isn't really providing anything in that example). # Flow parameters
days = Parameter(
"days", default=pd.date_range("1981-09-01", "1981-09-10", freq="D")
)
variables = Parameter("variables", default=["anom", "err", "ice", "sst"])
cache_location = Parameter(
"cache_location", default=f"gs://pangeo-forge-scratch/cache/{name}.zarr"
)
target_location = Parameter(
"target_location", default=f"gs://pangeo-forge-scratch/{name}.zarr"
)
@property
def sources(self):
source_url_pattern = (
"https://www.ncei.noaa.gov/data/"
"sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/"
"{yyyymm}/oisst-avhrr-v02r01.{yyyymmdd}.nc"
)
source_urls = [
source_url_pattern.format(yyyymm=day.strftime("%Y%m"), yyyymmdd=day.strftime("%Y%m%d"))
for day in self.days
]
return source_urls
@property
def flow(self):
....
nc_sources = [
download(x, cache_location=self.cache_location)
for x in self.sources # a regular python list
]
Prefect Parameter
s can't be iterated over, so we need a "prefect-native" way of doing it. In this case, the source_url definition was moved to a task
and we map it over the input variable.
with Flow(self.name) as _flow:
sources = source_url.map(self.days)
One potential downside of Prefect parameters, they must be JSON serializable (which I just ran into, since datetime.datetime
objects aren't JSON serializable)
Now that we've started creating actual pangeo forge datasets, we're starting to notice the need for flow parameterization. I'll provide a few examples of where we may want to use a parameterized variable in our flows:
target_url
without needing to run the flow twice.Prefect supports parameterizing flows (https://docs.prefect.io/core/examples/parameterized_flow.html). The question is whether we want to use the prefect functionality or move toward a pangeo-forge api for this sort of thing.