microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
Apache License 2.0
1.76k stars 163 forks source link

AML deployment error due to missing az cli arguments #115

Open aponte411 opened 1 year ago

aponte411 commented 1 year ago

When trying to run the aml example, e.g. bloom aml, it tries to run get_acr_name() but fails because its missing the resource group name argument. Is there be a way to pass in user arguments such as the resource group, subscription, etc? It would also be nice to expose more arguments for the aml online endpoints such as the auth_mode, e.g. we arent allowed to use keys, only aml_tokens in production environments. But I can also imagine other deployment attributes/arguments being useful as well such as instance_count or type.

[2022-12-08 10:53:37,253] [INFO] [deployment.py:87:deploy] ************* MII is using DeepSpeed Optimizations to accelerate your model *************
ERROR: the following arguments are required: --resource-group/-g, --name/-n

Examples from AI knowledge base:
https://aka.ms/cli_ref
Read more about the command in reference docs

 ------------------------------ 

Unable to obtain ACR name from Azure-CLI. Please verify that you:
        - Have Azure-CLI installed (https://learn.microsoft.com/en-us/cli/azure/install-azure-cli)
        - Are logged in to an active account on Azure-CLI ($az login)
        - Have Azure-CLI ML plugin installed ($az extension add --name ml)

 ------------------------------ 

Traceback (most recent call last):
  File "/mnt/c/Users/davidaponte/Documents/CS677-DeepLearning/deeplearning/deeplearning/deep_learning/text_to_image/deepspeed_mii/bloom560m-aml.py", line 7, in <module>
    mii.deploy(task='text-generation',
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/deployment.py", line 112, in deploy
    _deploy_aml(deployment_name=deployment_name, model_name=model, version=version)
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/deployment.py", line 124, in _deploy_aml
    acr_name = mii.aml_related.utils.get_acr_name()
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/utils.py", line 31, in get_acr_name
    raise (e)
  File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/utils.py", line 13, in get_acr_name
    acr_name = subprocess.check_output(
  File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['az', 'ml', 'workspace', 'show', '--query', 'container_registry']' returned non-zero exit status 2.

Setup: deepspeed==0.7.6 deepspeed-mii==0.0.4 py3.9.0 Ubuntu 20.04.4 LTS (Focal Fossa)

mrwyattii commented 1 year ago

@aponte411 currently we expect the user to set resource group and subscription from the Azure-CLI like so: az account set --subscription "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"

I agree that exposing more options and expanding the AML deployment capabilities would be nice. Let me know if you have some time to help test/debug/expand these capabilities!

buswrecker commented 1 year ago

Hi, i resolved this - i had an issue with torch (had to pip uninstall nvidia_cublas_cu11) and also i wasn't on a GPU VM. Managed to build the folder with deploy.sh and deploying to a managed endpoint now

@aponte411 - did you make any progress? i'm getting the same error in Jupyter notebook.

deepspeed==0.8.2 deepspeed-mii==0.05+unknown python==3.8.0 Ubuntu==20.04.1

TahaBinhuraib commented 1 year ago

@buswrecker I can run deepspeed mii from the gpu vm but I still can't deploy, I get the same error: subprocess.CalledProcessError: Command '['az', 'ml', 'workspace', 'show', '--query', 'container_registry']' returned non-zero exit status 2.

jakemanger commented 1 year ago

I also could not get this working after following instructions in the readme. The only way I could use aml is after I overrode the get_acr_name() function to return my acr name instead of calling the az cli command. Is there a way to set a default --name argument for this command so this can be fixed and it returns the correct acr name?

The command I'm talking about is:

["az",
  "ml",
  "workspace",
   "show",
   "--query",
   "container_registry"],

Note, I also tried putting --name myworkspacename in as an argument and it just returned ------

ShuntaIto commented 8 months ago

I'm facing same problem, on GPU VM.

Maybe, adding "shell=True" will resolve this problem?

acr_name = subprocess.check_output(
            ["az",
             "ml",
             "workspace",
             "show",
             "--query",
             "container_registry"],
            text=True, shell=True)