ploomber / soopervisor

☁️ Export Ploomber pipelines to Kubernetes (Argo), Airflow, AWS Batch, SLURM, and Kubeflow.
https://soopervisor.readthedocs.io
Apache License 2.0
45 stars 18 forks source link

Placeholders break soopervisor #116

Closed edublancas closed 1 year ago

edublancas commented 2 years ago

In Ploomber, we have placeholders that users can put in their pipeline.yaml. For example, to index experiments by {{git_hash}} or timestamp with {{now}}. However, the value of such placeholders is computed when loading the pipeline.yaml.

This works fine in Ploomber since the pipelines execute in a single machine; however, when using soopervisor, this breaks. For example, the {{git_hash}} isn't available because we don't copy the git repository, only the existing files. And {{now}} breaks because it's evaluated for each task. Hence, each one sees a different timestamp.

We can solve this by resolving placeholders when executing soopervisor export and putting a "resolved" pipeline.yaml in the Docker container. Ploomber already has a feature that allows defining placeholders in a separate file (env.yaml), we can use this feature to facilitate the implementation.

This is a high level description of what I think it's the best way to do this:

docker.build is the function that builds the docker image, and it already receives the path to the pipeline.yaml in the entry_point argument. We can use this value and pass it to this function (note that this is defined in ploomber), which will return is what env.yaml file the entry_point will use, let's call this value path_to_env.

We can use this function to get the values of the placeholders (like {{now}} and {{git_hash}}).

from ploomber.env.envdict import EnvDict

env=EnvDict({})

env._data

Then, we apply some logic depending on the value of path_to_env, right before compressing the source code.

if path_to_env is None, we create an env.yaml in the same directory as the pipeline.yaml, and write the contents of the placeholders, if path_to_env is not None, then we load its contents and append the values of the placeholders. However, we should not override keys (only add the ones that do not exist).

edublancas commented 2 years ago

note that this will also close https://github.com/ploomber/soopervisor/issues/114

edublancas commented 2 years ago

hey, @feribg. thoughts? wanna tackle this?

feribg commented 2 years ago

Yeah sounds fair, how about append or override null keys in env.yaml re the last step ? @edublancas

feribg commented 2 years ago

Do you mean here we should actually add the values function For example placeholders['git_hash'] = '{{git_hash}}' should be come placeholders['git_hash'] = repo.get_current_hash() or along those lines.

Should we not have code to remove those from evaluating in the runtime in the docker container then ? I guess it will still look for {{git_hash}} to replace at runtime.

Also I think we can use this generic mechanism and extend the scope here so that the env var substitution can be done here as well. IE referring env vars as placeholders in the pipeline config and then injecting them into this new .env file. This might require some convention for which im not sure env[] is the ideal syntax due to the common usage of those as special chars, but either way is fine.

edublancas commented 2 years ago

For the env, I think this is where we can pass os.environ: https://github.com/ploomber/ploomber/blob/856369d600ddd6a11f2ae06f415ea819191bc353/src/ploomber/env/expand.py#L170 - but I'm not 100% sure

feribg commented 1 year ago

@edublancas Here are the preliminary PRs https://github.com/ploomber/soopervisor/pull/120 https://github.com/ploomber/ploomber/pull/1042

I still have one question re {{now}} I dont think we can inject that in the env.yaml as the other vars as this is idempotent per run but not per build. IE i want all my steps to use the now as of starting the pipeline but not as of building it. How do you think we should tackle this, its the biggest blocker for now i think the rest is resolved there.

edublancas commented 1 year ago

can you clarify what you mean by run and build?

I think {{now}} should have the same value across all tasks, so it should be computed once when submitting to AWS Batch, and then tasks should use that value. I think that's desirable since it'll allow users to index their runs (e.g., today's run, yesterday's run, etc.).

However, I believe this is not what you want in your use case, but you want the {{now}} timestamp to be computed whenever a task starts, correct?

If so, I think the simplest way to solve it is by adding a new placeholder automatically, e.g., {{start}} - which can behave as I described (one per run), and we can leave {{now}} untouched.

feribg commented 1 year ago

That is indeed correct I want {{now}} to be evaluated when the whole pipeline starts, but is not the case with our current solution, be cause currently now would be inserted in the env.yaml upon building the docker image. So if i run this image 10 times at different times, now will be the same (as of building the image).

Currently now is being evaluated when a task starts (this is why we have a problem with dependencies because we put {{now}} in the path for products and downstream tasks run later hence cant find the deps of upstream tasks.

In a sense we almost need {{now_as_of_pipeline_run}} and {{now_as_of_task_run}}, the latter being the current behavior.