ploomber / soopervisor

☁️ Export Ploomber pipelines to Kubernetes (Argo), Airflow, AWS Batch, SLURM, and Kubeflow.
https://soopervisor.readthedocs.io
Apache License 2.0
45 stars 18 forks source link

support for multiple requirements.txt #77

Closed edublancas closed 2 years ago

edublancas commented 2 years ago

(Note: this only applies for Docker-based exporters like AWS, and Airflow)

Currently, when generating the Docker image to run tasks, users need to list the dependencies they want to include. For example, a user may have the following:

# content of requirements.txt
scikit-learn
pandas

However, in some cases, users may have a long list of dependencies. This is often the case when using many ML packages that implement different models (to find which model works best). Unfortunately, long environments are sometimes impossible to set up, up if two or more dependencies have conflicts. e.g. scikit-learn requires numpy==1.2 but xgboost requires numpy>1.2. If that happens, users will be unable to create a single docker image.

To fix this, we should allow them to declare multiple requirements.txt files and then generate multiple Docker images; note that this does not require multiple Dockerfile since we can re-use the same one, with a different requirements.txt and tag them differently so they effectively become multiple images.

To keep things simple, we can have a naming convention. For example, if a DAG has tasks: load, clean, fit-0, fit-1, and fit-2. We could do something like this:

If a user has requirements.txt and nothing else, then generate a single image.

If a user has requirement.txt, and any other files that match the format requirements.*.txt: match filenames to task names, see below.

Pattern matching

(we can use fnmatch for this)

Say the user has requirements.txt, and requirements.fit-*.txt

use requirements.fit-*.txt for tasks fit-0, fit-1, and fit-2. use requirements.txt for all the other tasks

Open question: is the * problematic? I gave it a try and doesn't seem to be any issues on UNIX if a filename has * on it. The other thing is to ensure that docker allows * in image names. If not, we should use another wildcard, (perhaps double underscore? __)

Changes should be applied here

The docker.build function currently returns a list of image names, with this change, it should extend it to return all the generated images.

Modifying aws batch exporter

Once images are pushed, the aws batch exporter needs to pattern match the task names to the right images.

testing

We should test that given some combinations of requirements.txt files, the docker CLI is called with the right arguments and that the tasks are scheduled in AWS with the right arguments as well.

neelasha23 commented 2 years ago

To clarify: the requirements.txt referred to here is the one inside the projects folder that the user has created ? e.g., for the template/ml-intermediate project it would be https://github.com/ploomber/projects/blob/master/templates/ml-intermediate/requirements.txt ?

I'm trying out the tutorial here https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html for understanding the soopervisor workflow and have the below issues:

  1. I see that in https://github.com/ploomber/soopervisor/blob/master/tutorials/airflow/Dockerfile we are already downloading the ml-intermediate example and installing the requirements.txt. This step as well : COPY soopervisor-airflow.yaml soopervisor-airflow.yaml. But the same steps are repeated in https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html#get-sample-ploomber-pipeline and https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html#configure-target-platform. Why is that so?

  2. I'm stuck at the pip install -r requirements.txt step in https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html#get-sample-ploomber-pipeline. There seems to be a lot of dependency incompatibility issues

    flask 1.1.4 requires click<8.0,>=5.1, but you have click 8.1.3 which is incompatible.
    flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.0.3 which is incompatible.
    Successfully installed click-8.1.3 jinja2-3.0.3

If I manually install the latest Flask version I get this error:

nbconvert 6.5.0 requires jinja2>=3.0, but you have jinja2 2.11.3 which is incompatible.
black 22.3.0 requires click>=8.0.0, but you have click 7.1.2 which is incompatible.
edublancas commented 2 years ago

To clarify: the requirements.txt referred to here is the one inside the projects folder

yes, the typical structure is:

project-name/
    pipeline.yaml
    requirements.txt
    ...

Im trying out the tutorial here https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html for

Use the argo exporter, airflow is a bit harder to get working because airflow has too many dependencies and it tends to break the environment (as you saw). You don't need to have kubernetes running. You can run this quick instructions: https://soopervisor.readthedocs.io/en/latest/cookbook/kubernetes.html

but if you want to see the full end to end workflow working (you probably need this once you start making code changes), check this one out: https://soopervisor.readthedocs.io/en/latest/tutorials/kubernetes.html

let me know if you have questions about the tutorial or the code in general

neelasha23 commented 2 years ago
  1. Are we looking for requirements.*.txt files or requirements.*.lock.txt files? Is the user expected to provide all the necessary requirements.*.lock.txt files or environment.*.lock.yml files ( assuming that environment.*.lock.yml files naming convention will be similar as requirements.*.lock.txt files)

  2. Is it possible to have no requirements.txt file and only requirements.*.txt files, e.g., requirements.load.txt, requirements.clean.txt, requirements.plot.txt ?

edublancas commented 2 years ago

good question. Currently, we require requirements.lock.txt, so let's stick with requirements.*.lock.txt for now.

yes, the convention will be environment.*.lock.yml

  1. Is the user expected to provide all the necessary requirements.*.lock.txt files or environment.*.lock.yml files

no, they should be optional. For example, if the user has a task "fit" in their pipeline, but there is no requirements.fit.lock.txt, then we should use requirements.lock.txt.

2. Is it possible to have no requirements.txt file and only requirements.*.txt files, e.g., requirements.load.txt, requirements.clean.txt, requirements.plot.txt ?

this is related to the previous question. the only mandatory one is requirements.lock.txt

neelasha23 commented 2 years ago

On following the example here : https://soopervisor.readthedocs.io/en/latest/cookbook/kubernetes.html , these are my observations and high-level design. @edublancas Please provide your inputs

  1. The change will affect the following exporters: argo, airflow, aws/batch, kubeflow.

  2. Once the backend is added, the _add method of respective exporter is called , in which the Dockerfile is created. In case of Argo it would be https://github.com/neelasha23/soopervisor/blob/b230a5bb415dd4e9b05e1448b21b17c2150dc75c/src/soopervisor/argo/export.py#L32 No code changes required here in any of the exporters since we need to reuse the same Dockerfile.

  3. Changes for step 1 (generating images) to be done in https://github.com/ploomber/soopervisor/blob/feee82d168f0527b198db746acc1411243f0f3f6/src/soopervisor/commons/docker.py#L30 a. Here we need to add checks for requirements..lock.txt / environment..lock.yml files, e.g., if pattern matched files are requirements.txt, requirements.fit-0.txt and requirements.fit-1.txt then there should be corresponding requirements.txt, requirements.fit-0.lock.txt and requirements.fit-1.lock.txt. Similar validation for environment.*.lock.yml files.

    b. https://github.com/ploomber/soopervisor/blob/feee82d168f0527b198db746acc1411243f0f3f6/src/soopervisor/commons/docker.py#L30

    Pseudocode:

    all_images = []
    #Assuming we got this list by fnmatch/ glob:
    for file in [requirements.lock.txt, requirements.fit-0.lock.txt] / [environment lock files]:
         #creating a fresh folder for each requirement file
         e.rm('dist')  
         e.cp(file)
         .....
         ......
         #task would be fit-0/ fit-1
         image_local = f'{pkg_name}-{task}:{version}'
         ....
         ....
         all_images.append(image_target)
    ....
    ....
    return pkg_name, all_images

    Some doubts here:

  4. The code here https://github.com/neelasha23/soopervisor/blob/b230a5bb415dd4e9b05e1448b21b17c2150dc75c/src/soopervisor/commons/docker.py#L87

is copying all the files in the project folder. We need to to copy only the relevant requirements*.lock.txt file (exclude remaining requirements files using exclude parameter ? ). What about the other files? Do we need to copy only the relevant task files or every image will be containing all the project files?

  1. I have tried with repository = null in soopervisor.yaml file. Can this be replaced by a personal private repository in DockerHub?

  2. Do you think I need to explore Docker-compose for this task?

  3. Let's say task1's requirements.txt is in image1, and task2 (whose upstream task is task1) is in image2 with a different requirements file. What if task1 generates an output file that is required by task2? Will having different images not cause an issue here when the export happens?

idomic commented 2 years ago
  1. We're also copying the stats local dir of the user if I'm not wrong, I think as long as it's not too big files we can keep it as is.
  2. It does, I was testing it with my docker hub repo.
  3. Why do you think we need docker-compose here?
  4. Good point! Currently in the cloud we are saving the artifacts in S3, this could also apply here and the user can consume it via the clients functionality. If we need to pass files between containers/images we're coupling them together which is the opposite of what we'd like.
neelasha23 commented 2 years ago

So I have made some changes in docker.py here (looping over all the requirements.*.lock.txt files) : https://github.com/neelasha23/soopervisor/blob/issue_77/src/soopervisor/commons/docker.py

Now the issue is : It runs fine for the first requirements files, but for the subsequent one it throws error FAILED tests/test_commons.py::test_docker_build_multiple_requirement - FileNotFoundError: [Errno 2] No such file or directory: 'dist'

On debugging I realized that at this point here : https://github.com/ploomber/ploomber/blob/4bf4d937e1dee8d3defe9a3e89e83720860819d2/src/ploomber/io/_commander.py#L260

for the first run the dst folder (which looks like /private/var/folders/ff/jydfr9....assets/multiple_requirements_project/some-env/dist , does not exist, hence doesnt get deleted. But for the second run it exists, gets deleted, in turn dist folder also gets removed from the project inside this function : https://github.com/ploomber/ploomber/blob/4bf4d937e1dee8d3defe9a3e89e83720860819d2/src/ploomber/io/_commander.py#L24

So I added an explicit delete method in ploomber/io/_commander.py like so:

    def force_delete(self):
        self.rm(*self._to_delete)

and called this method here : https://github.com/neelasha23/soopervisor/blob/2bcd0099300fc67e5f5508dc9b1eb1846da0879d/src/soopervisor/commons/docker.py#L187

But this doesnt seem to solve the problem. Although the dst folder gets deleted with this change it somehow comes back before the second run.

Any suggestions on what exactly I'm missing here?

edublancas commented 2 years ago

some feedback:

task would be fit-0/ fit-1

  image_local = f'{pkg_name}-{task}:{version}'

We want to generate one image per pattern, not one image per task. Think of an scenario of a pipeline that has 100 tasks, but only a requirements.a.txt and requirements.b.txt, we want to generate two images, not 100.

is copying all the files in the project folder. We ne

yeah that's fine. we can copy the entire project in all images

I have tried with repository = null in soopervisor.yaml file. Can this be replaced by a personal private repository in DockerHub?

Yes

Do you think I need to explore Docker-compose for this task?

What would the benefit be?

Let's say task1's requirements.txt is in image1, and task2 (whose upstream task is task1) is in image2 with a different requirements file. What if task1 generates an output file that is required by task2? Will having different images not cause an issue here when the export happens?

that's fine. tasks can use different images, and there would be no problems when passing input/output data. But note something: after we generate the docker image (I say the cause we only support one right now) - we pass the name and tag so the next step uses it to schedule the jobs.

the implementation of that next step depends on the backend. for example, when using Argo, we need do pass the name and tag to a specific field in the argo.yml file, when using aws batch, we need o pass the name and tag to the boto3 API call that schedules the job. Given that our current priority is the AWS Batch export, let's first implement it there.

But this doesnt seem to solve the problem. Although the dst folder gets deleted with this change it somehow comes back before the second run. Any suggestions on what exactly I'm missing here?

I went through the code to try to find what the error is but I couldn't. My suggestion is to use pdb and go line by line during the execution of docker.build to see when dist gets created, when it gets deleted and so on. once you have more info, post it here so we can provide more guidance

neelasha23 commented 2 years ago

We want to generate one image per pattern, not one image per task. Think of an scenario of a pipeline that has 100 tasks, but only a requirements.a.txt and requirements.b.txt, we want to generate two images, not 100.

Yes I realised this once I started making the code changes after commenting ,and it's currently as per the requirement above.

Do you think I need to explore Docker-compose for this task?

What would the benefit be?

When I started understanding Docker compose I realised it's useful for spinning up multiple containers corresponding to different types of applications. So I don't think it's needed for this use case.

I went through the code to try to find what the error is but I couldn't. My suggestion is to use pdb and go line by line during the execution of [docker.build]

I did a debug again and realised I should change the current directory back to the parent of env at the end of the loop, that solved the issue.

However now it seems docker is not able to create images with * in the tag name. As of now I have replaced * with all just to get it running. The images look like this on my local machine now where main corresponds to the generic requirements. (I will drop it if not required) :

REPOSITORY                 TAG            IMAGE ID       CREATED              SIZE
eda-clean-all              latest         b6a61d0cef04   About a minute ago   1.38GB
eda-main                   latest         7e2622bf9eee   3 minutes ago        1.38GB

Contents of eda project before running soopervisor export argo --skip-tests --ignore-git:

eda % ls
README.ipynb            argo                environment.yml         requirements.clean-*.txt    scripts
README.md           argo.yaml           pipeline.yaml           requirements.lock.txt       soopervisor.yaml
_source.md          dist                requirements.clean-*.lock.txt   requirements.txt

Contents of each image:

 docker run -it eda-main sh                                     
# ls
README.ipynb  README.md  _source.md  argo  argo.yaml  eda.tar.gz  environment.yml  pipeline.yaml  requirements.lock.txt  scripts  soopervisor.yaml
# cat requirements.lock.txt
pandas
pandas-profiling
ploomber# 
docker run -it eda-clean-all sh                                
# ls
README.ipynb  README.md  _source.md  argo  argo.yaml  eda.tar.gz  environment.yml  pipeline.yaml  requirements.lock.txt  scripts  soopervisor.yaml
# cat requirements.lock.txt
plotly
seaborn
# 

Can we drop the * altogether and have requirements files named as requirements.txt, requirements.fit.txt and similar for docker image tags. In the downstream exporter we can check for all tasks which have fit in their name use requirements.fit.txt otherwise requirements.txt ?

edublancas commented 2 years ago

Can we drop the * altogether and have requirements files named as requirements.txt, requirements.fit.txt and similar for docker image tags. In the downstream exporter we can check for all tasks which have fit in their name use requirements.fit.txt otherwise requirements.txt ?

This makes sense, but we already have a wildcard feature to match resources (e.g. requesting GPU) using wildcards (aka the * character) and I think it's better to be consistent. Also, I think the * is well established as a placeholder character. Perhaps we can substitute it when generating the docker image? Maybe by a double underscore (__)

neelasha23 commented 2 years ago

Okay. So the requirements files will still be named as requirements.fit-*.txt etc but while generating the image it would be eda-fit-__ (just an example) And in exporter we will match tasks (e.g., fit-0, fit-1) to images which have fit-__ in their tag?

edublancas commented 2 years ago

correct!

neelasha23 commented 2 years ago

I get the same error with as well: `invalid argument "eda-clean-:latest" for "-t, --tag" flag: invalid reference format`

Issues when executing soopervisor export training --skip-tests --ignore-git from this tutorial https://soopervisor.readthedocs.io/en/latest/tutorials/aws-batch.html:

  1. No environment conda activate ml-online after running ploomber install.

  2. failed to compute cache key: "/environment.lock.yml" not found. The issue got resolved when I manually placed the environment.lock.yml file in the training folder. Not sure if this is an issue or I missed any intermediate step.

  3. RUN mamba env update --name base --file project/environment.lock.yml && conda clean --all --force-pkgs-dir --yes this statement in the Dockerfile is failing with multiple dependency resolution issues. I'm trying to remove specific versions or add explicit conda install <package> statements in Dockerfile wherever possible. The issues seem to be happening because my environment already has too many dependencies before hand since I was trying out other examples. User might face similar issues?

  4. This particular package contains setup.py file. Which means it goes into this block here:

https://github.com/ploomber/soopervisor/blob/3ee5dec738c576652b0840704d6f1e37a6678fb2/src/soopervisor/commons/docker.py#L76-L81

I'm not sure how this piece will change for multiple requirement/ environment files.

  1. Can I use this project for testing AWS batch : https://github.com/ploomber/projects/blob/master/templates/ml-basic/pipeline.yaml or can you suggest one which would be a good fit
neelasha23 commented 2 years ago

Update: I have been successfully able to submit this example https://github.com/ploomber/projects/blob/master/templates/ml-basic/pipeline.yaml to AWS batch with a single requirements.lock.txt file.

idomic commented 2 years ago

@neelasha23 so I assume all of the above were solved? Now what we're missing is having multiples?

neelasha23 commented 2 years ago

With respect to code changes, point 4 is still an open question. I haven't made any changes in this if block as of yet.

The other points are an issue for the ml-online project, but I'm working with ml-basic right now for testing end-to-end.

This one is also open (image creation fails with __ ):

I get the same error with as well: invalid argument "eda-clean-:latest" for "-t, --tag" flag: invalid reference format

idomic commented 2 years ago

For the first one, the setup is similar to the reqs.txt in esence, I don't see a point of having/supporting both. For the __, maybe we can assume we're getting a list of reqs names and we can tag the images on those names, or even numbering them like eda-clean-1,2,3... (even thought it's not really informative). It's alright to test with the ml-basic now, but once done we should figure what's wrong with the online template (@edublancas thoughts?)

neelasha23 commented 2 years ago

Yes I will try out the online template with a fresh environment. But since it has a setup.py file the control doesn't enter the else block where I have made changes for generating multiple images. So I dropped it as of now.

neelasha23 commented 2 years ago

In any case the image name is getting converted to this format: 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest since we are specifying repository in soopervisor.yaml. Currently I'm maintaining a mapping like

{
 'fit-*' : <image_name>,
 'default' : <image_name>
} 

and matching the keys with task name using regex downstream.

edublancas commented 2 years ago

hi, sorry for the delay. the notification got lost in my inbox.

Re __ problem. docker's documentation says this:

Name components may contain lowercase letters, digits and separators. A separator is defined as a period, one or two underscores, or one or more dashes. A name component may not start or end with a separator.

https://docs.docker.com/engine/reference/commandline/tag/#extended-description

so maybe the solution is to just append some fixed string at the end?

eda-clean-__-ploomber:latest

(I'm unsure what's the best suffix, but let's use ploomber for now)

Re setup.py: I think let's raise an error if we detect the user has multiple requirements.{something}.txt files and we go into that branch of the conditional. the setup.py thing is rarely used, so let's not worry about it for now and raise a NotImplementedError

neelasha23 commented 2 years ago

Updates:

  1. Replaced * with ploomber in the image tags.
  2. Created a requirements.fit-*.lock.txt file in ml-basic project for the task corresponding to fit.py and named the task as fit-0 in pipeline.yaml
  3. On submitting the pipeline I see the following Jobs on AWS Console. So 3 of the jobs get tagged with the latest-default image and fit-0 task gets tagged to image with tag latest-fit-ploomber.
    
    Name : get
    Status reason : Essential container in task exited
    Status : FAILED
    Job id : b8d2c987-05d3-4240-bbbb-e1b7d455270c
    Image : 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest-default

Name : features Status reason : Dependent Job failed Status : FAILED Job id : c1e5c217-e963-40e5-9e2d-7d33aec1e8df Depends on : b8d2c987-05d3-4240-bbbb-e1b7d455270c Image : 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest-default

Name : join Status reason : Dependent Job failed Status : FAILED Job id : 9eec6bd9-2f23-4609-8ec9-20ed2867485c Depends on : c1e5c217-e963-40e5-9e2d-7d33aec1e8df, b8d2c987-05d3-4240-bbbb-e1b7d455270c Image : 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest-default

Name : fit-0 Status reason : Dependent Job failed Status : FAILED Job id : 016252a4-6b63-4e43-a04b-985eae64e497 Depends on : 9eec6bd9-2f23-4609-8ec9-20ed2867485c Image : 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest-fit-ploomber


I see the following log for the first job:

2022-06-09T06:15:42.213Z standard_init_linux.go:228: exec user process caused: exec format error @ingestionTime 1654755342270 @log 888374473732:/aws/batch/job @logStream ml-basic/default/93606e056f6d48cca7ddbd443cc411b9 @message standard_init_linux.go:228: exec user process caused: exec format error @timestamp 1654755342213



This is probably happening because the images were built on Apple M1. I'm trying the solutions and will post an update once it works.