skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.48k stars 461 forks source link

[Azure blob] Storage account already exists in a different subscription #3783

Closed romilbhardwaj closed 2 weeks ago

romilbhardwaj commented 1 month ago

I switched azure subscriptions recently.

Tried launching a simple YAML in the new subscription:

file_mounts:
  /outputs:
    name: romil-az-test
    store: azure
    source: /Users/romilb/tmp-workdir/
    mode: MOUNT

But failed with:

(base) ➜  sky-experiments git:(fix_test_docker_storage) ✗  sky launch -c test task.yaml 
Task from YAML spec: task.yaml
I 07-24 17:35:25 storage.py:2260] Created Azure resource group 'sky2ea485ea'.
E 07-24 17:35:33 storage.py:901] Could not create StoreType.AZURE store with name romil-az-test.
Traceback (most recent call last):
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 2265, in _get_storage_account_and_resource_group
    self.storage_client.storage_accounts.get_properties(
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/azure/mgmt/storage/v2022_09_01/operations/_storage_accounts_operations.py", line 1071, in get_properties
    map_error(status_code=response.status_code, response=response, error_map=error_map)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/azure/core/exceptions.py", line 164, in map_error
    raise error
azure.core.exceptions.ResourceNotFoundError: (ResourceNotFound) The Resource 'Microsoft.Storage/storageAccounts/skyeastus2ea485ea' under resource group 'sky2ea485ea' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix
Code: ResourceNotFound
Message: The Resource 'Microsoft.Storage/storageAccounts/skyeastus2ea485ea' under resource group 'sky2ea485ea' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 2296, in _create_storage_account
    self.storage_client.storage_accounts.begin_create(
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/azure/mgmt/storage/v2022_09_01/operations/_storage_accounts_operations.py", line 913, in begin_create
    raw_result = self._create_initial(  # type: ignore
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/azure/mgmt/storage/v2022_09_01/operations/_storage_accounts_operations.py", line 768, in _create_initial
    map_error(status_code=response.status_code, response=response, error_map=error_map)
  File "/Users/romilb/tools/anaconda3/lib/python3.9/site-packages/azure/core/exceptions.py", line 164, in map_error
    raise error
azure.core.exceptions.ResourceExistsError: (StorageAccountAlreadyTaken) The storage account named skyeastus2ea485ea is already taken.
Code: StorageAccountAlreadyTaken
Message: The storage account named skyeastus2ea485ea is already taken.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 893, in add_store
    store = store_cls(
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 1993, in __init__
    super().__init__(name, source, region, is_sky_managed,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 261, in __init__
    self.initialize()
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 2142, in initialize
    self._get_storage_account_and_resource_group())
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 2271, in _get_storage_account_and_resource_group
    self._create_storage_account(resource_group_name,
  File "/Users/romilb/Romil/Berkeley/Research/sky-experiments/sky/data/storage.py", line 2315, in _create_storage_account
    raise exceptions.StorageBucketCreateError(
sky.exceptions.StorageBucketCreateError: Failed to create storage account 'skyeastus2ea485ea'. You may be attempting to create a storage account already being in use. Details: [azure.core.exceptions.ResourceExistsError] (StorageAccountAlreadyTaken) The storage account named skyeastus2ea485ea is already taken.
Code: StorageAccountAlreadyTaken
Message: The storage account named skyeastus2ea485ea is already taken.

We should have some kind of fallback here - is the storage account name is already taken we should add a random suffix and retry?

romilbhardwaj commented 1 month ago

For now, working around this by adding this to ~/.sky/config.yaml:

azure:
  storage_account: <custom storage account I created on azure portal>
romilbhardwaj commented 1 month ago

Another solution is to use the subscription id in the hash for the storage account name.