Closed edublancas closed 1 year ago
note that this will also close https://github.com/ploomber/soopervisor/issues/114
hey, @feribg. thoughts? wanna tackle this?
Yeah sounds fair, how about append or override null keys in env.yaml re the last step ? @edublancas
Do you mean here we should actually add the values function
For example placeholders['git_hash'] = '{{git_hash}}'
should be come placeholders['git_hash'] = repo.get_current_hash()
or along those lines.
Should we not have code to remove those from evaluating in the runtime in the docker container then ? I guess it will still look for {{git_hash}} to replace at runtime.
Also I think we can use this generic mechanism and extend the scope here so that the env var substitution can be done here as well. IE referring env vars as placeholders in the pipeline config and then injecting them into this new .env file. This might require some convention for which im not sure env[]
is the ideal syntax due to the common usage of those as special chars, but either way is fine.
For the env, I think this is where we can pass os.environ
: https://github.com/ploomber/ploomber/blob/856369d600ddd6a11f2ae06f415ea819191bc353/src/ploomber/env/expand.py#L170 - but I'm not 100% sure
@edublancas Here are the preliminary PRs https://github.com/ploomber/soopervisor/pull/120 https://github.com/ploomber/ploomber/pull/1042
I still have one question re {{now}}
I dont think we can inject that in the env.yaml as the other vars as this is idempotent per run but not per build. IE i want all my steps to use the now
as of starting the pipeline but not as of building it. How do you think we should tackle this, its the biggest blocker for now i think the rest is resolved there.
can you clarify what you mean by run and build?
I think {{now}}
should have the same value across all tasks, so it should be computed once when submitting to AWS Batch, and then tasks should use that value. I think that's desirable since it'll allow users to index their runs (e.g., today's run, yesterday's run, etc.).
However, I believe this is not what you want in your use case, but you want the {{now}}
timestamp to be computed whenever a task starts, correct?
If so, I think the simplest way to solve it is by adding a new placeholder automatically, e.g., {{start}}
- which can behave as I described (one per run), and we can leave {{now}}
untouched.
That is indeed correct I want {{now}}
to be evaluated when the whole pipeline starts, but is not the case with our current solution, be cause currently now would be inserted in the env.yaml
upon building the docker image. So if i run this image 10 times at different times, now will be the same (as of building the image).
Currently now is being evaluated when a task starts (this is why we have a problem with dependencies because we put {{now}}
in the path for products and downstream tasks run later hence cant find the deps of upstream tasks.
In a sense we almost need {{now_as_of_pipeline_run}}
and {{now_as_of_task_run}}
, the latter being the current behavior.
In Ploomber, we have placeholders that users can put in their
pipeline.yaml
. For example, to index experiments by{{git_hash}}
or timestamp with{{now}}
. However, the value of such placeholders is computed when loading thepipeline.yaml
.This works fine in Ploomber since the pipelines execute in a single machine; however, when using soopervisor, this breaks. For example, the
{{git_hash}}
isn't available because we don't copy the git repository, only the existing files. And{{now}}
breaks because it's evaluated for each task. Hence, each one sees a different timestamp.We can solve this by resolving placeholders when executing
soopervisor export
and putting a "resolved"pipeline.yaml
in the Docker container. Ploomber already has a feature that allows defining placeholders in a separate file (env.yaml
), we can use this feature to facilitate the implementation.This is a high level description of what I think it's the best way to do this:
docker.build
is the function that builds the docker image, and it already receives the path to thepipeline.yaml
in theentry_point
argument. We can use this value and pass it to this function (note that this is defined in ploomber), which will return is whatenv.yaml
file theentry_point
will use, let's call this valuepath_to_env
.We can use this function to get the values of the placeholders (like {{now}} and {{git_hash}}).
Then, we apply some logic depending on the value of
path_to_env
, right before compressing the source code.if
path_to_env
isNone
, we create anenv.yaml
in the same directory as thepipeline.yaml
, and write the contents of the placeholders, ifpath_to_env
is not None, then we load its contents and append the values of the placeholders. However, we should not override keys (only add the ones that do not exist).