Open aponte411 opened 1 year ago
@aponte411 currently we expect the user to set resource group and subscription from the Azure-CLI like so:
az account set --subscription "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
I agree that exposing more options and expanding the AML deployment capabilities would be nice. Let me know if you have some time to help test/debug/expand these capabilities!
Hi, i resolved this - i had an issue with torch (had to pip uninstall nvidia_cublas_cu11) and also i wasn't on a GPU VM. Managed to build the folder with deploy.sh and deploying to a managed endpoint now
@aponte411 - did you make any progress? i'm getting the same error in Jupyter notebook.
deepspeed==0.8.2 deepspeed-mii==0.05+unknown python==3.8.0 Ubuntu==20.04.1
@buswrecker I can run deepspeed mii from the gpu vm but I still can't deploy, I get the same error:
subprocess.CalledProcessError: Command '['az', 'ml', 'workspace', 'show', '--query', 'container_registry']' returned non-zero exit status 2.
I also could not get this working after following instructions in the readme. The only way I could use aml is after I overrode the get_acr_name()
function to return my acr name instead of calling the az cli command. Is there a way to set a default --name
argument for this command so this can be fixed and it returns the correct acr name?
The command I'm talking about is:
["az",
"ml",
"workspace",
"show",
"--query",
"container_registry"],
Note, I also tried putting --name myworkspacename
in as an argument and it just returned ------
I'm facing same problem, on GPU VM.
Maybe, adding "shell=True" will resolve this problem?
acr_name = subprocess.check_output(
["az",
"ml",
"workspace",
"show",
"--query",
"container_registry"],
text=True, shell=True)
When trying to run the aml example, e.g. bloom aml, it tries to run get_acr_name() but fails because its missing the resource group name argument. Is there be a way to pass in user arguments such as the resource group, subscription, etc? It would also be nice to expose more arguments for the aml online endpoints such as the auth_mode, e.g. we arent allowed to use keys, only aml_tokens in production environments. But I can also imagine other deployment attributes/arguments being useful as well such as instance_count or type.
Setup: deepspeed==0.7.6 deepspeed-mii==0.0.4 py3.9.0 Ubuntu 20.04.4 LTS (Focal Fossa)