microsoftarchive / BatchAI

Repo for publishing code Samples and CLI samples for BatchAI service
MIT License
125 stars 62 forks source link

Create cluster is not working #20

Closed l-Leniac-l closed 6 years ago

l-Leniac-l commented 6 years ago

Hi there. I'm trying to create a batch ai cluster using python, but it doesn't work. Part o my code:

import azure.mgmt.batchai.models as bai_models
from azure.mgmt.batchai import BatchAIManagementClient
from azure.storage.file import FileService, FilePermissions
from azure.common.credentials import ServicePrincipalCredentials

import config

SHARED_STORAGE_NAME = 'captationbatchai'

CLUSTER_NAME = 'captationcluster'

def print_cluster_status(cluster):
    print(
        'Cluster state: {0} Target: {1}; Allocated: {2}; Idle: {3}; '
        'Unusable: {4}; Running: {5}; Preparing: {6}; Leaving: {7}'.format(
            cluster.allocation_state,
            cluster.scale_settings.manual.target_node_count,
            cluster.current_node_count,
            cluster.node_state_counts.idle_node_count,
            cluster.node_state_counts.unusable_node_count,
            cluster.node_state_counts.running_node_count,
            cluster.node_state_counts.preparing_node_count,
            cluster.node_state_counts.leaving_node_count))
    if not cluster.errors:
        return
    for error in cluster.errors:
        print('Cluster error: {0}: {1}'.format(error.code, error.message))
        if error.details:
            print('Details:')
            for detail in error.details:
                print('{0}: {1}'.format(detail.name, detail.value))

def create_batchai_client():
    client = BatchAIManagementClient(credentials=ServicePrincipalCredentials(client_id=config.azure_application['client_id'],
                                                                             secret=config.azure_application['secret'],
                                                                             tenant=config.azure_application['tenant_id']),
                                     subscription_id=config.azure_application['subscription_id'])

    return client

def create_file_share():
    service = FileService(config.azure_storage['account_name'],
                          config.azure_storage['account_key'])
    service.create_share(SHARED_STORAGE_NAME, fail_on_exist=False)

def create_cluster(node_count):
    volumes = bai_models.MountVolumes(azure_file_shares=[
        bai_models.AzureFileShareReference(account_name=config.azure_storage['account_name'],
                                           credentials=bai_models.AzureStorageCredentialsInfo(account_key=config.azure_storage['account_key']),
                                           azure_file_url='https://{0}.file.core.windows.net/{1}'.format(config.azure_storage['account_name'],
                                                                                                         SHARED_STORAGE_NAME),
                                           relative_mount_path='external')
    ])

    parameters = bai_models.ClusterCreateParameters(
        location='eastus',
        vm_size='STANDARD_NC6',
        scale_settings=bai_models.ScaleSettings(manual=bai_models.ManualScaleSettings(target_node_count=node_count)),
        node_setup=bai_models.NodeSetup(mount_volumes=volumes),
        user_account_settings=bai_models.UserAccountSettings(admin_user_name='kroton',
                                                             admin_user_password='123456##')
    )

    client = create_batchai_client()

    client.clusters.create(config.azure_data_factory['resource_group'],
                           CLUSTER_NAME,
                           parameters)

def cluster_status():
    print('Status------------------')
    client = create_batchai_client()
    cluster = client.clusters.get(config.azure_data_factory['resource_group'],
                                  CLUSTER_NAME)
    print_cluster_status(cluster)

When i try to print cluster status i get this result:

Traceback (most recent call last):
  File "/home/lenilson/Documentos/git/1sti/kroton/k360/kroton.k360.datafactory/ai/captation/batchai.py", line 75, in cluster_status
    CLUSTER_NAME)
  File "/home/lenilson/anaconda3/envs/k360-datafactory/lib/python3.6/site-packages/azure/mgmt/batchai/operations/clusters_operations.py", line 355, in get
    raise exp
msrestazure.azure_exceptions.CloudError: Azure Error: ClusterNotFound
Message: The specified cluster captationcluster is not found

But when i executed the create function no errors appeared

AlexanderYukhanov commented 6 years ago

Hello, Thanks you for asking this question! client.clusters.create is a long running operation and it returns before a cluster got created or failed to create. You need to change your code to client.clusters.create(config.azure_data_factory['resource_group'], CLUSTER_NAME, parameters).result() to be sure the cluster got created before you are getting its status and to be sure you got an error response from the server (if any). I will update our recipes to be sure nobody face the same confusion.

l-Leniac-l commented 6 years ago

Thanks for helping @AlexanderYukhanov . I tried your suggestion and i didnt get any result after almost one hour running the script. I'll try to use Azure Batch instead. I need to run a pre-trained model, i think i will be able to do that on Azure Batch.

AlexanderYukhanov commented 6 years ago

Do you mean cluster.create().result() has not finished?

l-Leniac-l commented 6 years ago

Nope. After almost one hour i stopped the command. It was running "forever"

AlexanderYukhanov commented 6 years ago

Thank you so much for reporting this issue. Your batch account provisioning has not completed successfully and that's why you observing this issue. We will investigate and resolve the problem.

rohitmsft commented 6 years ago

We have resolved the issue. We can see that your cluster was created successfully and you have deleted the cluster as well on 1/19 1540 UTC.