ploomber / soopervisor

☁️ Export Ploomber pipelines to Kubernetes (Argo), Airflow, AWS Batch, SLURM, and Kubeflow.
https://soopervisor.readthedocs.io
Apache License 2.0
45 stars 18 forks source link

`--lazy` flag does not work when the pipeline has a client #105

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

When loading a pipeline that contains functions as tasks, we need to import them. This implies that the current Python environment requires all the packages required to import such functions. For example:

# we need sklearn and pandas to successfully load this file!
import sklearn
import pandas as pd

def my_task(product, upstream):
    pass

However, in some cases, it's desirable to load a pipeline without all these dependencies. For example, if a CD worker is pushing a pipeline to Kubernetes or AWS via soopervisor, we shouldn't require all those dependencies in the CD worker: it doesn't need them anyway (since it's only pushing the pipeline and not executing it) and installing packages slows down the build.

To fix that we added a --lazy option which loads a Ploomber pipeline lazy mode and it doesn't import the functions. However, this option isn't working when the pipeline has a client configured (e.g.S3Client).

The loading process happens here (lines 96-98):

https://github.com/ploomber/soopervisor/blob/4e837a000f70214bed16c7e24d6d9fe9d6a2327e/src/soopervisor/commons/dag.py#L95

Then, we check if the dag has a client:

https://github.com/ploomber/soopervisor/blob/4e837a000f70214bed16c7e24d6d9fe9d6a2327e/src/soopervisor/commons/dag.py#L107

The problem is that since this pipeline was loaded in lazy mode, we don't have the actual client but a dotted path string (e.g. clients.load_client). Calling dag.clients.get(File) triggers importing the dotted path to get the actual client; however, by default, Python doesn't add the current working directory to the path, hence, the import fails.

I think the best solution is momentarily to add the current working directory to the path, we can use this context manager:

https://github.com/ploomber/ploomber/blob/0fe1d97df24cb4161f741479933f605fd0c645fa/src/ploomber/util/util.py#L277