googleapis / python-aiplatform

A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning.
Apache License 2.0
615 stars 331 forks source link

ACCESS_TOKEN_SCOPE_INSUFFICIENT when trying to aiplatform.init() inside a custom container CustomJob #902

Closed jasonbrancazio closed 2 years ago

jasonbrancazio commented 2 years ago

I want to use metadata store experiment tracking with CustomJobs so I can log parameters and metrics.

When I run a CustomJob with a custom container in Vertex AI, I get a ACCESS_TOKEN_SCOPE_INSUFFICIENT error when I try to initialize the aiplatform SDK with aiplatform.init().

I've tried to remedy this error by passing scoped credentials to aiplatform.init(), but as you can see from the stacktrace below, it does not work.

I can successfully run aiplatform.init() and create an experiment on my laptop using ipython when not passing any credentials or passing credentials received from google.auth.default(). In this case I'm using application default credentials for my user, which is the owner of my project.

I can also run aiplatform.init() in ipython on my laptop with a service account that has only the Vertex AI Custom Code Service Agent role. This was an experiment to attempt to mirror the role granted to the AI Platform Custom Code Service Agent when Vertex AI runs a CustomJob.

If I temporarily upgrade the AI Platform Custom Code Service Agent to an owner role, and run the custom container, I still get the error. The issue thus seems to relate to Oauth scoping and not role assignment.

To reproduce, I've provided a minimal example. Build this dockerfile, push it to Container Registry, and create a Custom Job using the web UI for Vertex Training. The failure occurs when aiplatform.init() is called. From the stacktrace we can see the error arises specifically when get_or_create from metadata_store.py is called.

FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-6

ENV PYTHONUNBUFFERED=1

CMD python -c "import google.auth;from google.cloud import aiplatform;creds, _ = google.auth.default(scopes=['https://www.googleapis.com/auth/cloud-platform']);aiplatform.init(experiment='test')"

Here is the stacktrace I received:

Traceback (most recent call last):

  File "/opt/conda/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/opt/conda/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.PERMISSION_DENIED
    details = "Request had insufficient authentication scopes."
    debug_error_string = "{"created":"@1639078855.874434208","description":"Error received from peer ipv4:142.250.125.95:443","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Request had insufficient authentication scopes.","grpc_status":7}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/initializer.py", line 110, in init
    experiment=experiment, description=experiment_description
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata.py", line 65, in set_experiment
    _MetadataStore.get_or_create()
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 119, in get_or_create
    credentials=credentials,
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 237, in _get
    credentials=credentials,
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 69, in __init__
    self._gca_resource = self._get_gca_resource(resource_name=metadata_store_name)
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/base.py", line 540, in _get_gca_resource
    name=resource_name, retry=_DEFAULT_RETRY
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform_v1/services/metadata_service/client.py", line 635, in get_metadata_store
    response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 154, in __call__
    return wrapped_func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
    on_error=on_error,
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 68, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc

google.api_core.exceptions.PermissionDenied: 403 Request had insufficient authentication scopes. [reason: "ACCESS_TOKEN_SCOPE_INSUFFICIENT"

domain: "googleapis.com"

metadata {
  key: "method"
  value: "google.cloud.aiplatform.v1.MetadataService.GetMetadataStore"
}

metadata {
  key: "service"
  value: "aiplatform.googleapis.com"
}
]

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/opt/conda/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.PERMISSION_DENIED
    details = "Request had insufficient authentication scopes."
    debug_error_string = "{"created":"@1639078886.996219456","description":"Error received from peer ipv4:172.217.212.95:443","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Request had insufficient authentication scopes.","grpc_status":7}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/initializer.py", line 110, in init
    experiment=experiment, description=experiment_description
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata.py", line 65, in set_experiment
    _MetadataStore.get_or_create()
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 119, in get_or_create
    credentials=credentials,
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 237, in _get
    credentials=credentials,
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 69, in __init__
    self._gca_resource = self._get_gca_resource(resource_name=metadata_store_name)
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/base.py", line 540, in _get_gca_resource
    name=resource_name, retry=_DEFAULT_RETRY
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform_v1/services/metadata_service/client.py", line 635, in get_metadata_store
    response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 154, in __call__
    return wrapped_func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
    on_error=on_error,
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 68, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc

google.api_core.exceptions.PermissionDenied: 403 Request had insufficient authentication scopes. [reason: "ACCESS_TOKEN_SCOPE_INSUFFICIENT"

domain: "googleapis.com"

metadata {
  key: "method"
  value: "google.cloud.aiplatform.v1.MetadataService.GetMetadataStore"
}

metadata {
  key: "service"
  value: "aiplatform.googleapis.com"
}
]

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 66, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/opt/conda/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.PERMISSION_DENIED
    details = "Request had insufficient authentication scopes."
    debug_error_string = "{"created":"@1639078930.850171481","description":"Error received from peer ipv4:74.125.70.95:443","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Request had insufficient authentication scopes.","grpc_status":7}"
>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/initializer.py", line 110, in init
    experiment=experiment, description=experiment_description
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata.py", line 65, in set_experiment
    _MetadataStore.get_or_create()
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 119, in get_or_create
    credentials=credentials,
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 237, in _get
    credentials=credentials,
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/metadata/metadata_store.py", line 69, in __init__
    self._gca_resource = self._get_gca_resource(resource_name=metadata_store_name)
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform/base.py", line 540, in _get_gca_resource
    name=resource_name, retry=_DEFAULT_RETRY
  File "/opt/conda/lib/python3.7/site-packages/google/cloud/aiplatform_v1/services/metadata_service/client.py", line 635, in get_metadata_store
    response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/gapic_v1/method.py", line 154, in __call__
    return wrapped_func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
    on_error=on_error,
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/opt/conda/lib/python3.7/site-packages/google/api_core/grpc_helpers.py", line 68, in error_remapped_callable
    raise exceptions.from_grpc_error(exc) from exc
google.api_core.exceptions.PermissionDenied: 403 Request had insufficient authentication scopes. [reason: "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
sasha-gitg commented 2 years ago

@jasonbrancazio

  1. CustomJob runs in separate project managed by Google. That is the default project retrieved by client libraries. Reference: Accessing Google Cloud Services in your code.

Please pass your project into init explicitly:

aiplatform.init(project='my-project')

An alternative is to retrieve it from the environment if it's the same project the CustomJob is launched from:

aiplatform.init(project=os.environ.get("CLOUD_ML_PROJECT_ID"))
  1. Please pass custom credentials explicitly into init as well:
aiplatform.init(credentials=creds)

You can pass the entire configuration as one call:

aiplatform.init(project='my-project', credentials=creds, experiment='test')
jasonbrancazio commented 2 years ago

@sasha-gitg can you provide more details about how to instantiate the credentials? I'm trying to avoid copying a service account .json file to the Docker image.

It's interesting that I can access BigQuery and Cloud Storage inside a CustomJob without using service account but I can't initialize the aiplatform module.

jasonbrancazio commented 2 years ago

I think I found a relevant comment in the Vertex AI documentation: https://cloud.google.com/vertex-ai/docs/general/access-control#grant_service_agents_access_to_other_resources

"Note: If you want your custom training code to obtain an OAuth 2.0 access token with the https://www.googleapis.com/auth/cloud-platform scope, then you must use a custom service account for training. You cannot give this level of access to the Vertex AI Custom Code Service Agent."

Looks like I'm stuck with a custom service account. This is a limitation that the Vertex AI team should consider addressing. It seems unusual that someone should have to use a custom service account just to run aiplatform.init() in a CustomJob.

jasonbrancazio commented 2 years ago

For the sake of completeness, I'm confirming that I resolved my issue using something like the code snippet below inside a CustomJob using a custom service account. Thanks @sasha-gitg

# in a task.py called at the start of a CustomJob...
CREDENTIAL_PATH = '/training_app/creds.json'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = CREDENTIAL_PATH
credentials, project = google.auth.default()
aiplatform.init(project='my-project', experiment='my-experiment')
aiplatform.start_run(run='fakerun2')
# log your hyperparams to the experiment at the start of the run
aiplatform.log_params({'test_param': .01})
# train a model in your CustomJob, then log metrics
aiplatform.log_metrics({'test_metric': 1})

As you can see, I first tested by copying a service account .json file and setting GOOGLE_APPLICATION_CREDENTIALS. But there is an easier and more secure way that is not well documented.

You can simply specify the e-mail address of a service account when running a CustomJob with the python client rather than having to pass credentials to aiplatform.init() as @sasha-gitg suggested or copying the service account .json file into the Docker image. (Note that you can't use the UI to run the CustomJob if you go this route.)

You can give your custom service account the same "Vertex AI Custom Code Service Agent" IAM role that is used by the service agent.

Running the custom job then looks something like this:

SERVICE_ACCOUNT_EMAIL_ADDRESS='some_service_account@your_project.iam.gserviceaccount.com'

    custom_job = aiplatform.CustomJob(
        display_name=experiment_run_name,
        worker_pool_specs=worker_pool_specs,
        staging_bucket=staging_bucket
    )

    custom_job.run(
        service_account=SERVICE_ACCOUNT_EMAIL_ADDRESS, 
        enable_web_access=True
        )

It was fun to use enable_web_access to figure this out: you can temporarily modify your application code to have your container enter an infinite loop, then navigate to the CustomJob in the console, click the link to open up a terminal for debugging, and use ipython to see what privileges the running container has.

These docs were helpful, in particular the statement "If you are creating a CustomJob, specify the service account's email address in CustomJob.jobSpec.serviceAccount": https://cloud.google.com/vertex-ai/docs/general/custom-service-account#attach

Note that the python client does not have a way to specify CustomJob.jobSpec.serviceAccount directly. I had to check out the source code for aiplatform.CustomJob and its run() method to see the way to specify the service account e-mail.

morgandu commented 2 years ago

Closing the issue since it seems fixed. Feel free to reopen if there is other related issue.