ploomber / soopervisor

☁️ Export Ploomber pipelines to Kubernetes (Argo), Airflow, AWS Batch, SLURM, and Kubeflow.
https://soopervisor.readthedocs.io
Apache License 2.0
45 stars 18 forks source link

bundling local libraries when building Docker image #85

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

Users may want to install packages that are not available publicly on PyPI. A typical use case is an internal library stored in some git repository or a private registry.

For example, assume the local environment is properly configured to authenticate with a private git repository, someone can easily install that package with a requirements.txt:

git+https://path.to/repo

However, in some cases, soopervisor export might not run locally, but in a CI/CD server; hence installing from private repositories won't work unless the CI/CD server is properly authenticated to get the repository (https://path.to/repo).

This also applies to Ploomber Cloud users since the Docker image is built remotely. In Ploomber Cloud's case is even more desirable to have this solution since users will prefer not to disclose their credentials.

The solution is for the Docker image to be able to pick up packages from a local directory. For example, a user might have the following layout:

lib/
  package_a/
  package_b/

tasks/
  load.py
  clean.py
  fit.py

pipeline.yaml

The Dockerfile should configure the PYTHONPATH so that it includes the lib/ directory; hence when starting Python, import package_a and import package_b will work.

To get package_a under lib, users can use pip install, but we should do some research to find out how to do it, these two look promising:

  --root <dir>                Install everything relative to this alternate root
                              directory.
  --prefix <dir>              Installation prefix where lib, bin and other top-level
                              folders are placed

The final thing to take into account is that for packages that are not pure-Python (e.g. the ones that use C extensions), if the local OS is different than the OS where the pipeline will be executed; we might run into issues; but I think we can ignore that for now.

cc @idomic @Wxl19980214

Wxl19980214 commented 2 years ago

I think I am getting a sense of this issue. But how do we know which packages are not on PyPI? Maybe try look into requirement.txt and see if it starts with git? Or can we just download all packages of users under their lib/ dir?

Wxl19980214 commented 2 years ago

So here is my understanding, the command that creates a docker image is sooperviosr add. And the result of the docker looks like this FROM condaforge/mambaforge:4.10.1-0

COPY requirements.lock.txt project/requirements.lock.txt

RUN pip install --requirement project/requirements.lock.txt && rm -rf /root/.cache/pip/

COPY dist/* project/ WORKDIR /project/

RUN tar --strip-components=1 -zxvf *.tar.gz RUN cp -r /project/ploomber/ /root/.ploomber/ || echo 'ploomber home does not exist'

So instead of just run "RUN pip install --requirement project/requirements.lock.txt && rm -rf /root/.cache/pip/" We should add something like "Run pip install --root

" to get packages that are potentially not on PyPi? I am not really sure what you mean by configure PYTHONPATH of the dockefile.

idomic commented 2 years ago

Not pip install.

We should first check if a dependency lib exist, if so we should import it to the image:

lib/
  package_a/
  package_b/

Then we should export the path of the lib into PYTHONPATH so that's available for usage within python.

edublancas commented 2 years ago

Yeah so downloading the packages is up to the user (maybe they just copy-paste the files into lib). The assumption is that the user knows how to get the package's source code into lib/ - we don't have to figure out of they're on PyPI or not.

The solution is to ensure that whatever is on lib/ is importable from Python, to do that, we need to ensure the folder is on PYTHONPATH.

Note that lib/ and requirements.txt can be used at the same time, they are not mutually exclusive.

Wxl19980214 commented 2 years ago

So to check if a dependency lib exist, I did something like this "python -m site --user-site" /Users/xilinwang/Library/Python/2.7/lib/python/site-packages to get all packages for python. Is this lib/ the one you been talking about? So the rest is to add in Dockerfile something like this: ENV PYTHONPATH "${PYTHONPATH}:/Users/xilinwang/Library/Python/2.7/lib/python/site-packages"?

idomic commented 2 years ago

On the second part yes, on the first no. This is a user custom lib that's not part of the pytho/lib path.

Wxl19980214 commented 2 years ago

Sorry I am not familiar with python package stuff. Where is user's custom package located usually? Is it in their $PYTHONPATH ? For example, if i download a package from git, where exactly will it located?

idomic commented 2 years ago

No worries, this thread should give you the context you need on importing custom modules. In short, you have to have the module presents, and you need to specify the path to it as part of the interpreter's arguments (PYTHONPATH).

idomic commented 2 years ago

@Wxl19980214 Path('lib').exists() you can check if the library exists via python, then use a parameter to add it to the docker image in a similar manner to this:

{%- set name = 'environment.lock.yml' if conda else 'requirements.lock.txt' %}

COPY {{name}} project/{{name}}

reference: https://github.com/ploomber/soopervisor/blob/7fb1e64d1abaaad7cb4d6432b9511d5feda9c0eb/src/soopervisor/assets/kubeflow/Dockerfile

Wxl19980214 commented 2 years ago

Looks like django lol. I will take a look. But should I change all of the docerfiles? We have Dockerfile under different backend platforms?

idomic commented 2 years ago

Yes, start with one, see it works then replicate to the rest of the backends