googleapis / python-aiplatform

A Python SDK for Vertex AI, a fully managed, end-to-end platform for data science and machine learning.
Apache License 2.0
579 stars 319 forks source link

401 Client Error trying to capture Profile using Vertex AI Tensorboard #3913

Open conormelody opened 3 weeks ago

conormelody commented 3 weeks ago

I am trying to use Vertex AI Tensorflow Profiler to profile my custom training job based on this documentation.

My custom job runs successfully to completion, but I am unable to successfully capture a Profile in Vertex AI Tensorboard despite following the steps in the "Capture a profiling session" section of the documentation.

When I click "Capture" in the Profile section on Vertex AI Tensorboard after following these above steps, I receive an error which looks like:

Failed to capture profile: 401 Client Error: Unauthorized for url: https://...-dot-europe-west4.aiplatform-training.googleusercontent.com/profile/capture_profile?service_addr=workerpool0-0&is_tpu_name=false&duration=1000&num_retry=3&worker_list=&host_tracer_level=2&device_tracer_level=1&python_tracer_level=0&delay=0 Invalid OAuth Token . For information on how to setup the profiler, please visit: https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-profiler

The documentation linked above references the roles/storage.admin and roles/aiplatform.user service account roles. Both my own service account and the service account used to run the custom training job have both of these roles.

Are there additional permissions required in order to successfully capture a profiling session? Any help/advice on solving this issue would be greatly appreciated!

Environment details

conormelody commented 2 weeks ago

Related to this issue I found the following example on the vertex-ai-samples repo which might help reproduce this issue.

I have copied the training code and Dockerfile from this notebook and created a new service account with just the Storage Admin and Vertex AI User roles in the same project (the documentation referenced above mentions the roles/aiplatform.user but I guess this has been deprecated given AI Platform has been renamed Vertex AI?). This service account is used to run the custom training job.

My own User has the Service Account User role in this project, so should have permission to ActAs this service account.

Training code (train.py):


import tensorflow as tf
import argparse
import os
import sys, traceback
from google.cloud.aiplatform.training_utils import cloud_profiler

"""Train an mnist model and use cloud_profiler for profiling."""

def _create_model():
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(10),
        ]
    )
    return model

def main(args):
    # Initialize the profiler.
    print('Initialize the profiler ...')

    try:
        cloud_profiler.init()
    except:
        ex_type, ex_value, ex_traceback = sys.exc_info()
        print("*** Unexpected:", ex_type.__name__, ex_value)
        traceback.print_tb(ex_traceback, limit=10, file=sys.stdout)

    print('The profiler initiated.')

    print('Loading and preprocessing data ...')
    mnist = tf.keras.datasets.mnist

    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train, x_test = x_train / 255.0, x_test / 255.0

    print('Creating and training model ...')

    model = _create_model()
    model.compile(
      optimizer="adam",
      loss=tf.keras.losses.sparse_categorical_crossentropy,
      metrics=["accuracy"],
    )

    log_dir = "logs"
    if 'AIP_TENSORBOARD_LOG_DIR' in os.environ:
      log_dir = os.environ['AIP_TENSORBOARD_LOG_DIR']

    print('Setting up the TensorBoard callback ...')
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=log_dir,
        histogram_freq=1)

    print('Training model ...')
    model.fit(
        x_train,
        y_train,
        epochs=args.epochs,
        verbose=0,
        callbacks=[tensorboard_callback],
    )
    print('Training completed.')

    print('Saving model ...')

    model_dir = "model"
    if 'AIP_MODEL_DIR' in os.environ:
      model_dir = os.environ['AIP_MODEL_DIR']
    tf.saved_model.save(model, model_dir)

    print('Model saved at ' + model_dir)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--epochs", type=int, default=100, help="Number of epochs to run model."
    )

    args = parser.parse_args()
    main(args)

Dockerfile:

# Specifies base image and tag
FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-9:latest
WORKDIR /root

# Installs additional packages as you need.
RUN pip3 install "google-cloud-aiplatform[cloud_profiler]>=1.20.0"
RUN pip3 install "protobuf==3.20.3"

# Copies the trainer code to the docker image.
COPY . .

ENTRYPOINT ["python3", "train.py"]

The training job runs successfully to completion, but I'm still seeing this permission issue when trying to capture a profile. Please see screenshots attached.

capture_profile

client_error