mlflow / mlflow

Open source platform for the machine learning lifecycle
https://mlflow.org
Apache License 2.0
18.28k stars 4.13k forks source link

[BUG] mlflow build-and-push-container doesn't work on sagemaker #8703

Open mohitanchlia opened 1 year ago

mohitanchlia commented 1 year ago

Issues Policy acknowledgement

Willingness to contribute

Yes. I would be willing to contribute a fix for this bug with guidance from the MLflow community.

MLflow version

System information

Describe the problem

[BUG] mlflow build-and-push-container doesn't work on sagemaker. The command completes successfully and it registers the image on the container. However, when I run deployment it fails to bring up the sagemaker instance. In Sagemaker conatiner log I see the following error:

`Cloudwatch logs from sagemaker

Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/init.py", line 57, in _init _serve(env_manager) File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/init.py", line 84, in _serve _serve_pyfunc(m, env_manager) File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/init.py", line 156, in _serve_pyfunc _install_pyfunc_deps( File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/init.py", line 120, in _install_pyfunc_deps env_activate_cmd = _get_or_create_virtualenv(model_path) File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/virtualenv.py", line 364, in _get_or_create_virtualenv python_bin_path = _install_python( File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/virtualenv.py", line 133, in _install_python _exec_cmd( File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/process.py", line 117, in _exec_cmd raise ShellCommandException.from_completed_process(comp_process) Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/init.py", line 57, in _init _serve(env_manager) File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/init.py", line 84, in _serve _serve_pyfunc(m, env_manager) File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/init.py", line 156, in _serve_pyfunc _install_pyfunc_deps( File "/usr/local/lib/python3.8/dist-packages/mlflow/models/container/init.py", line 120, in _install_pyfunc_deps env_activate_cmd = _get_or_create_virtualenv(model_path) File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/virtualenv.py", line 364, in _get_or_create_virtualenv python_bin_path = _install_python( File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/virtualenv.py", line 133, in _install_python _exec_cmd( File "/usr/local/lib/python3.8/dist-packages/mlflow/utils/process.py", line 117, in _exec_cmd raise ShellCommandException.from_completed_process(comp_process)`

When I run this image locally with just run command and no args I still get File "<string>", line 1, in <module>

I am stuck at this point as there is no visibility in this

Tracking information

REPLACE_ME

Code to reproduce issue

REPLACE_ME

Stack trace

REPLACE_ME

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

BenWilson2 commented 1 year ago

Can you share a repro?

BenWilson2 commented 1 year ago

Can you try this with MLflow 1.30.x and see if this issue still presents itself? Thank you!

mlflow-automation commented 1 year ago

@BenWilson2 @dbczumar @harupy @WeichenXu123 Please assign a maintainer and start triaging this issue.

mohitanchlia commented 1 year ago

Can you try this with MLflow 1.30.x and see if this issue still presents itself? Thank you!

I am using mlflow 1.32 I believe

mohitanchlia commented 1 year ago

Can you share a repro?

https://github.com/aws-samples/amazon-sagemaker-mlflow-fargate/tree/main

pdifranc commented 1 year ago

Can you share a repro?

https://github.com/aws-samples/amazon-sagemaker-mlflow-fargate/tree/main

This repo deploys MLflow 2.0.1. There has been a major change in the mlflow sdk on how to deploy to sagemaker. Can you please make sure your MLflow SDK, MLflow tracking server are the same?

mohitanchlia commented 1 year ago

Can you share a repro?

https://github.com/aws-samples/amazon-sagemaker-mlflow-fargate/tree/main

This repo deploys MLflow 2.0.1. There has been a major change in the mlflow sdk on how to deploy to sagemaker. Can you please make sure your MLflow SDK, MLflow tracking server are the same?

I verified that I am using 2.0.1 client