Closed edublancas closed 2 years ago
To clarify: the requirements.txt referred to here is the one inside the projects folder that the user has created ? e.g., for the template/ml-intermediate project it would be https://github.com/ploomber/projects/blob/master/templates/ml-intermediate/requirements.txt ?
I'm trying out the tutorial here https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html for understanding the soopervisor workflow and have the below issues:
I see that in https://github.com/ploomber/soopervisor/blob/master/tutorials/airflow/Dockerfile we are already downloading the ml-intermediate example and installing the requirements.txt. This step as well : COPY soopervisor-airflow.yaml soopervisor-airflow.yaml
. But the same steps are repeated in https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html#get-sample-ploomber-pipeline and https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html#configure-target-platform. Why is that so?
I'm stuck at the pip install -r requirements.txt step in https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html#get-sample-ploomber-pipeline. There seems to be a lot of dependency incompatibility issues
flask 1.1.4 requires click<8.0,>=5.1, but you have click 8.1.3 which is incompatible.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.0.3 which is incompatible.
Successfully installed click-8.1.3 jinja2-3.0.3
If I manually install the latest Flask version I get this error:
nbconvert 6.5.0 requires jinja2>=3.0, but you have jinja2 2.11.3 which is incompatible.
black 22.3.0 requires click>=8.0.0, but you have click 7.1.2 which is incompatible.
To clarify: the requirements.txt referred to here is the one inside the projects folder
yes, the typical structure is:
project-name/
pipeline.yaml
requirements.txt
...
Im trying out the tutorial here https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html for
Use the argo exporter, airflow is a bit harder to get working because airflow has too many dependencies and it tends to break the environment (as you saw). You don't need to have kubernetes running. You can run this quick instructions: https://soopervisor.readthedocs.io/en/latest/cookbook/kubernetes.html
but if you want to see the full end to end workflow working (you probably need this once you start making code changes), check this one out: https://soopervisor.readthedocs.io/en/latest/tutorials/kubernetes.html
let me know if you have questions about the tutorial or the code in general
Are we looking for requirements.*.txt
files or requirements.*.lock.txt
files? Is the user expected to provide all the necessary requirements.*.lock.txt
files or environment.*.lock.yml
files ( assuming that environment.*.lock.yml
files naming convention will be similar as requirements.*.lock.txt
files)
Is it possible to have no requirements.txt file and only requirements.*.txt files, e.g., requirements.load.txt, requirements.clean.txt, requirements.plot.txt ?
good question. Currently, we require requirements.lock.txt
, so let's stick with requirements.*.lock.txt
for now.
yes, the convention will be environment.*.lock.yml
- Is the user expected to provide all the necessary
requirements.*.lock.txt
files orenvironment.*.lock.yml
files
no, they should be optional. For example, if the user has a task "fit" in their pipeline, but there is no requirements.fit.lock.txt
, then we should use requirements.lock.txt
.
2. Is it possible to have no requirements.txt file and only requirements.*.txt files, e.g., requirements.load.txt, requirements.clean.txt, requirements.plot.txt ?
this is related to the previous question. the only mandatory one is requirements.lock.txt
On following the example here : https://soopervisor.readthedocs.io/en/latest/cookbook/kubernetes.html , these are my observations and high-level design. @edublancas Please provide your inputs
The change will affect the following exporters: argo, airflow, aws/batch, kubeflow.
Once the backend is added, the _add method of respective exporter is called , in which the Dockerfile is created. In case of Argo it would be https://github.com/neelasha23/soopervisor/blob/b230a5bb415dd4e9b05e1448b21b17c2150dc75c/src/soopervisor/argo/export.py#L32 No code changes required here in any of the exporters since we need to reuse the same Dockerfile.
Changes for step 1 (generating images) to be done in https://github.com/ploomber/soopervisor/blob/feee82d168f0527b198db746acc1411243f0f3f6/src/soopervisor/commons/docker.py#L30 a. Here we need to add checks for requirements..lock.txt / environment..lock.yml files, e.g., if pattern matched files are requirements.txt, requirements.fit-0.txt and requirements.fit-1.txt then there should be corresponding requirements.txt, requirements.fit-0.lock.txt and requirements.fit-1.lock.txt. Similar validation for environment.*.lock.yml files.
Pseudocode:
all_images = []
#Assuming we got this list by fnmatch/ glob:
for file in [requirements.lock.txt, requirements.fit-0.lock.txt] / [environment lock files]:
#creating a fresh folder for each requirement file
e.rm('dist')
e.cp(file)
.....
......
#task would be fit-0/ fit-1
image_local = f'{pkg_name}-{task}:{version}'
....
....
all_images.append(image_target)
....
....
return pkg_name, all_images
Some doubts here:
The code here https://github.com/neelasha23/soopervisor/blob/b230a5bb415dd4e9b05e1448b21b17c2150dc75c/src/soopervisor/commons/docker.py#L87
is copying all the files in the project folder. We need to to copy only the relevant requirements*.lock.txt file (exclude remaining requirements files using exclude parameter ? ). What about the other files? Do we need to copy only the relevant task files or every image will be containing all the project files?
I have tried with repository = null in soopervisor.yaml file. Can this be replaced by a personal private repository in DockerHub?
Do you think I need to explore Docker-compose for this task?
Let's say task1's requirements.txt is in image1, and task2 (whose upstream task is task1) is in image2 with a different requirements file. What if task1 generates an output file that is required by task2? Will having different images not cause an issue here when the export happens?
So I have made some changes in docker.py here (looping over all the requirements.*.lock.txt files) : https://github.com/neelasha23/soopervisor/blob/issue_77/src/soopervisor/commons/docker.py
Now the issue is : It runs fine for the first requirements files, but for the subsequent one it throws error
FAILED tests/test_commons.py::test_docker_build_multiple_requirement - FileNotFoundError: [Errno 2] No such file or directory: 'dist'
On debugging I realized that at this point here : https://github.com/ploomber/ploomber/blob/4bf4d937e1dee8d3defe9a3e89e83720860819d2/src/ploomber/io/_commander.py#L260
for the first run the dst folder (which looks like /private/var/folders/ff/jydfr9....assets/multiple_requirements_project/some-env/dist
, does not exist, hence doesnt get deleted. But for the second run it exists, gets deleted, in turn dist
folder also gets removed from the project inside this function : https://github.com/ploomber/ploomber/blob/4bf4d937e1dee8d3defe9a3e89e83720860819d2/src/ploomber/io/_commander.py#L24
So I added an explicit delete method in ploomber/io/_commander.py like so:
def force_delete(self):
self.rm(*self._to_delete)
and called this method here : https://github.com/neelasha23/soopervisor/blob/2bcd0099300fc67e5f5508dc9b1eb1846da0879d/src/soopervisor/commons/docker.py#L187
But this doesnt seem to solve the problem. Although the dst folder gets deleted with this change it somehow comes back before the second run.
Any suggestions on what exactly I'm missing here?
some feedback:
task would be fit-0/ fit-1
image_local = f'{pkg_name}-{task}:{version}'
We want to generate one image per pattern, not one image per task. Think of an scenario of a pipeline that has 100 tasks, but only a requirements.a.txt and requirements.b.txt, we want to generate two images, not 100.
is copying all the files in the project folder. We ne
yeah that's fine. we can copy the entire project in all images
I have tried with repository = null in soopervisor.yaml file. Can this be replaced by a personal private repository in DockerHub?
Yes
Do you think I need to explore Docker-compose for this task?
What would the benefit be?
Let's say task1's requirements.txt is in image1, and task2 (whose upstream task is task1) is in image2 with a different requirements file. What if task1 generates an output file that is required by task2? Will having different images not cause an issue here when the export happens?
that's fine. tasks can use different images, and there would be no problems when passing input/output data. But note something: after we generate the docker image (I say the cause we only support one right now) - we pass the name and tag so the next step uses it to schedule the jobs.
the implementation of that next step depends on the backend. for example, when using Argo, we need do pass the name and tag to a specific field in the argo.yml file, when using aws batch, we need o pass the name and tag to the boto3 API call that schedules the job. Given that our current priority is the AWS Batch export, let's first implement it there.
But this doesnt seem to solve the problem. Although the dst folder gets deleted with this change it somehow comes back before the second run. Any suggestions on what exactly I'm missing here?
I went through the code to try to find what the error is but I couldn't. My suggestion is to use pdb and go line by line during the execution of docker.build
to see when dist gets created, when it gets deleted and so on. once you have more info, post it here so we can provide more guidance
We want to generate one image per pattern, not one image per task. Think of an scenario of a pipeline that has 100 tasks, but only a requirements.a.txt and requirements.b.txt, we want to generate two images, not 100.
Yes I realised this once I started making the code changes after commenting ,and it's currently as per the requirement above.
Do you think I need to explore Docker-compose for this task?
What would the benefit be?
When I started understanding Docker compose I realised it's useful for spinning up multiple containers corresponding to different types of applications. So I don't think it's needed for this use case.
I went through the code to try to find what the error is but I couldn't. My suggestion is to use pdb and go line by line during the execution of [
docker.build
]
I did a debug again and realised I should change the current directory back to the parent of env
at the end of the loop, that solved the issue.
However now it seems docker is not able to create images with *
in the tag name. As of now I have replaced * with all
just to get it running. The images look like this on my local machine now where main
corresponds to the generic requirements. (I will drop it if not required) :
REPOSITORY TAG IMAGE ID CREATED SIZE
eda-clean-all latest b6a61d0cef04 About a minute ago 1.38GB
eda-main latest 7e2622bf9eee 3 minutes ago 1.38GB
Contents of eda project before running soopervisor export argo --skip-tests --ignore-git
:
eda % ls
README.ipynb argo environment.yml requirements.clean-*.txt scripts
README.md argo.yaml pipeline.yaml requirements.lock.txt soopervisor.yaml
_source.md dist requirements.clean-*.lock.txt requirements.txt
Contents of each image:
docker run -it eda-main sh
# ls
README.ipynb README.md _source.md argo argo.yaml eda.tar.gz environment.yml pipeline.yaml requirements.lock.txt scripts soopervisor.yaml
# cat requirements.lock.txt
pandas
pandas-profiling
ploomber#
docker run -it eda-clean-all sh
# ls
README.ipynb README.md _source.md argo argo.yaml eda.tar.gz environment.yml pipeline.yaml requirements.lock.txt scripts soopervisor.yaml
# cat requirements.lock.txt
plotly
seaborn
#
Can we drop the *
altogether and have requirements files named as requirements.txt, requirements.fit.txt
and similar for docker image tags. In the downstream exporter we can check for all tasks which have fit
in their name use requirements.fit.txt
otherwise requirements.txt
?
Can we drop the * altogether and have requirements files named as requirements.txt, requirements.fit.txt and similar for docker image tags. In the downstream exporter we can check for all tasks which have fit in their name use requirements.fit.txt otherwise requirements.txt ?
This makes sense, but we already have a wildcard feature to match resources (e.g. requesting GPU) using wildcards (aka the *
character) and I think it's better to be consistent. Also, I think the *
is well established as a placeholder character. Perhaps we can substitute it when generating the docker image? Maybe by a double underscore (__
)
Okay. So the requirements files will still be named as requirements.fit-*.txt etc but while generating the image it would be eda-fit-__
(just an example)
And in exporter we will match tasks (e.g., fit-0, fit-1) to images which have fit-__ in their tag?
correct!
I get the same error with as well: `invalid argument "eda-clean-:latest" for "-t, --tag" flag: invalid reference format`
Issues when executing soopervisor export training --skip-tests --ignore-git
from this tutorial https://soopervisor.readthedocs.io/en/latest/tutorials/aws-batch.html:
No environment conda activate ml-online
after running ploomber install.
failed to compute cache key: "/environment.lock.yml" not found. The issue got resolved when I manually placed the environment.lock.yml
file in the training folder. Not sure if this is an issue or I missed any intermediate step.
RUN mamba env update --name base --file project/environment.lock.yml && conda clean --all --force-pkgs-dir --yes
this statement in the Dockerfile is failing with multiple dependency resolution issues. I'm trying to remove specific versions or add explicit conda install <package>
statements in Dockerfile wherever possible. The issues seem to be happening because my environment already has too many dependencies before hand since I was trying out other examples. User might face similar issues?
This particular package contains setup.py file. Which means it goes into this block here:
I'm not sure how this piece will change for multiple requirement/ environment files.
Update: I have been successfully able to submit this example https://github.com/ploomber/projects/blob/master/templates/ml-basic/pipeline.yaml to AWS batch with a single requirements.lock.txt file.
@neelasha23 so I assume all of the above were solved? Now what we're missing is having multiples?
With respect to code changes, point 4 is still an open question. I haven't made any changes in this if block as of yet.
The other points are an issue for the ml-online project, but I'm working with ml-basic right now for testing end-to-end.
This one is also open (image creation fails with __ ):
I get the same error with as well: invalid argument "eda-clean-:latest" for "-t, --tag" flag: invalid reference format
For the first one, the setup is similar to the reqs.txt in esence, I don't see a point of having/supporting both. For the __, maybe we can assume we're getting a list of reqs names and we can tag the images on those names, or even numbering them like eda-clean-1,2,3... (even thought it's not really informative). It's alright to test with the ml-basic now, but once done we should figure what's wrong with the online template (@edublancas thoughts?)
Yes I will try out the online template with a fresh environment. But since it has a setup.py file the control doesn't enter the else block where I have made changes for generating multiple images. So I dropped it as of now.
In any case the image name is getting converted to this format: 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest
since we are specifying repository
in soopervisor.yaml.
Currently I'm maintaining a mapping like
{
'fit-*' : <image_name>,
'default' : <image_name>
}
and matching the keys with task name using regex downstream.
hi, sorry for the delay. the notification got lost in my inbox.
Re __
problem. docker's documentation says this:
Name components may contain lowercase letters, digits and separators. A separator is defined as a period, one or two underscores, or one or more dashes. A name component may not start or end with a separator.
https://docs.docker.com/engine/reference/commandline/tag/#extended-description
so maybe the solution is to just append some fixed string at the end?
eda-clean-__-ploomber:latest
(I'm unsure what's the best suffix, but let's use ploomber for now)
Re setup.py
: I think let's raise an error if we detect the user has multiple requirements.{something}.txt
files and we go into that branch of the conditional. the setup.py thing is rarely used, so let's not worry about it for now and raise a NotImplementedError
Updates:
*
with ploomber
in the image tags. requirements.fit-*.lock.txt file
in ml-basic project for the task corresponding to fit.py
and named the task as fit-0
in pipeline.yamllatest-default
image and fit-0 task gets tagged to image with tag latest-fit-ploomber
.
Name : get
Status reason : Essential container in task exited
Status : FAILED
Job id : b8d2c987-05d3-4240-bbbb-e1b7d455270c
Image : 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest-default
Name : features Status reason : Dependent Job failed Status : FAILED Job id : c1e5c217-e963-40e5-9e2d-7d33aec1e8df Depends on : b8d2c987-05d3-4240-bbbb-e1b7d455270c Image : 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest-default
Name : join Status reason : Dependent Job failed Status : FAILED Job id : 9eec6bd9-2f23-4609-8ec9-20ed2867485c Depends on : c1e5c217-e963-40e5-9e2d-7d33aec1e8df, b8d2c987-05d3-4240-bbbb-e1b7d455270c Image : 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest-default
Name : fit-0 Status reason : Dependent Job failed Status : FAILED Job id : 016252a4-6b63-4e43-a04b-985eae64e497 Depends on : 9eec6bd9-2f23-4609-8ec9-20ed2867485c Image : 888374473732.dkr.ecr.us-west-2.amazonaws.com/ploomber:latest-fit-ploomber
I see the following log for the first job:
2022-06-09T06:15:42.213Z standard_init_linux.go:228: exec user process caused: exec format error @ingestionTime 1654755342270 @log 888374473732:/aws/batch/job @logStream ml-basic/default/93606e056f6d48cca7ddbd443cc411b9 @message standard_init_linux.go:228: exec user process caused: exec format error @timestamp 1654755342213
This is probably happening because the images were built on Apple M1. I'm trying the solutions and will post an update once it works.
(Note: this only applies for Docker-based exporters like AWS, and Airflow)
Currently, when generating the Docker image to run tasks, users need to list the dependencies they want to include. For example, a user may have the following:
However, in some cases, users may have a long list of dependencies. This is often the case when using many ML packages that implement different models (to find which model works best). Unfortunately, long environments are sometimes impossible to set up, up if two or more dependencies have conflicts. e.g. scikit-learn requires
numpy==1.2
but xgboost requiresnumpy>1.2
. If that happens, users will be unable to create a single docker image.To fix this, we should allow them to declare multiple
requirements.txt
files and then generate multiple Docker images; note that this does not require multiple Dockerfile since we can re-use the same one, with a different requirements.txt and tag them differently so they effectively become multiple images.To keep things simple, we can have a naming convention. For example, if a DAG has tasks: load, clean, fit-0, fit-1, and fit-2. We could do something like this:
If a user has
requirements.txt
and nothing else, then generate a single image.If a user has
requirement.txt
, and any other files that match the formatrequirements.*.txt
: match filenames to task names, see below.Pattern matching
(we can use fnmatch for this)
Say the user has
requirements.txt
, andrequirements.fit-*.txt
use
requirements.fit-*.txt
for tasks fit-0, fit-1, and fit-2. userequirements.txt
for all the other tasksOpen question: is the
*
problematic? I gave it a try and doesn't seem to be any issues on UNIX if a filename has*
on it. The other thing is to ensure that docker allows*
in image names. If not, we should use another wildcard, (perhaps double underscore?__
)Changes should be applied here
The
docker.build
function currently returns a list of image names, with this change, it should extend it to return all the generated images.Modifying aws batch exporter
Once images are pushed, the aws batch exporter needs to pattern match the task names to the right images.
testing
We should test that given some combinations of
requirements.txt
files, thedocker
CLI is called with the right arguments and that the tasks are scheduled in AWS with the right arguments as well.