When loading a pipeline that contains functions as tasks, we need to import them. This implies that the current Python environment requires all the packages required to import such functions. For example:
# we need sklearn and pandas to successfully load this file!
import sklearn
import pandas as pd
def my_task(product, upstream):
pass
However, in some cases, it's desirable to load a pipeline without all these dependencies. For example, if a CD worker is pushing a pipeline to Kubernetes or AWS via soopervisor, we shouldn't require all those dependencies in the CD worker: it doesn't need them anyway (since it's only pushing the pipeline and not executing it) and installing packages slows down the build.
To fix that we added a --lazy option which loads a Ploomber pipeline lazy mode and it doesn't import the functions. However, this option isn't working when the pipeline has a client configured (e.g.S3Client).
The problem is that since this pipeline was loaded in lazy mode, we don't have the actual client but a dotted path string (e.g. clients.load_client). Calling dag.clients.get(File) triggers importing the dotted path to get the actual client; however, by default, Python doesn't add the current working directory to the path, hence, the import fails.
I think the best solution is momentarily to add the current working directory to the path, we can use this context manager:
When loading a pipeline that contains functions as tasks, we need to import them. This implies that the current Python environment requires all the packages required to import such functions. For example:
However, in some cases, it's desirable to load a pipeline without all these dependencies. For example, if a CD worker is pushing a pipeline to Kubernetes or AWS via soopervisor, we shouldn't require all those dependencies in the CD worker: it doesn't need them anyway (since it's only pushing the pipeline and not executing it) and installing packages slows down the build.
To fix that we added a
--lazy
option which loads a Ploomber pipeline lazy mode and it doesn't import the functions. However, this option isn't working when the pipeline has a client configured (e.g.S3Client
).The loading process happens here (lines 96-98):
https://github.com/ploomber/soopervisor/blob/4e837a000f70214bed16c7e24d6d9fe9d6a2327e/src/soopervisor/commons/dag.py#L95
Then, we check if the dag has a client:
https://github.com/ploomber/soopervisor/blob/4e837a000f70214bed16c7e24d6d9fe9d6a2327e/src/soopervisor/commons/dag.py#L107
The problem is that since this pipeline was loaded in lazy mode, we don't have the actual client but a dotted path string (e.g.
clients.load_client
). Callingdag.clients.get(File)
triggers importing the dotted path to get the actual client; however, by default, Python doesn't add the current working directory to the path, hence, the import fails.I think the best solution is momentarily to add the current working directory to the path, we can use this context manager:
https://github.com/ploomber/ploomber/blob/0fe1d97df24cb4161f741479933f605fd0c645fa/src/ploomber/util/util.py#L277