Open conormelody opened 5 months ago
Related to this issue I found the following example on the vertex-ai-samples repo which might help reproduce this issue.
I have copied the training code and Dockerfile from this notebook and created a new service account with just the Storage Admin and Vertex AI User roles in the same project (the documentation referenced above mentions the roles/aiplatform.user but I guess this has been deprecated given AI Platform has been renamed Vertex AI?). This service account is used to run the custom training job.
My own User has the Service Account User role in this project, so should have permission to ActAs this service account.
Training code (train.py):
import tensorflow as tf
import argparse
import os
import sys, traceback
from google.cloud.aiplatform.training_utils import cloud_profiler
"""Train an mnist model and use cloud_profiler for profiling."""
def _create_model():
model = tf.keras.models.Sequential(
[
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation="relu"),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10),
]
)
return model
def main(args):
# Initialize the profiler.
print('Initialize the profiler ...')
try:
cloud_profiler.init()
except:
ex_type, ex_value, ex_traceback = sys.exc_info()
print("*** Unexpected:", ex_type.__name__, ex_value)
traceback.print_tb(ex_traceback, limit=10, file=sys.stdout)
print('The profiler initiated.')
print('Loading and preprocessing data ...')
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
print('Creating and training model ...')
model = _create_model()
model.compile(
optimizer="adam",
loss=tf.keras.losses.sparse_categorical_crossentropy,
metrics=["accuracy"],
)
log_dir = "logs"
if 'AIP_TENSORBOARD_LOG_DIR' in os.environ:
log_dir = os.environ['AIP_TENSORBOARD_LOG_DIR']
print('Setting up the TensorBoard callback ...')
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir,
histogram_freq=1)
print('Training model ...')
model.fit(
x_train,
y_train,
epochs=args.epochs,
verbose=0,
callbacks=[tensorboard_callback],
)
print('Training completed.')
print('Saving model ...')
model_dir = "model"
if 'AIP_MODEL_DIR' in os.environ:
model_dir = os.environ['AIP_MODEL_DIR']
tf.saved_model.save(model, model_dir)
print('Model saved at ' + model_dir)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--epochs", type=int, default=100, help="Number of epochs to run model."
)
args = parser.parse_args()
main(args)
Dockerfile:
# Specifies base image and tag
FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-9:latest
WORKDIR /root
# Installs additional packages as you need.
RUN pip3 install "google-cloud-aiplatform[cloud_profiler]>=1.20.0"
RUN pip3 install "protobuf==3.20.3"
# Copies the trainer code to the docker image.
COPY . .
ENTRYPOINT ["python3", "train.py"]
The training job runs successfully to completion, but I'm still seeing this permission issue when trying to capture a profile. Please see screenshots attached.
I am trying to use Vertex AI Tensorflow Profiler to profile my custom training job based on this documentation.
My custom job runs successfully to completion, but I am unable to successfully capture a Profile in Vertex AI Tensorboard despite following the steps in the "Capture a profiling session" section of the documentation.
When I click "Capture" in the Profile section on Vertex AI Tensorboard after following these above steps, I receive an error which looks like:
The documentation linked above references the roles/storage.admin and roles/aiplatform.user service account roles. Both my own service account and the service account used to run the custom training job have both of these roles.
Are there additional permissions required in order to successfully capture a profiling session? Any help/advice on solving this issue would be greatly appreciated!
Environment details
google-cloud-aiplatform
version: google-cloud-aiplatform[cloud_profiler]==1.52.0