statmike / vertex-ai-mlops

Google Cloud Platform Vertex AI end-to-end workflows for machine learning operations
Apache License 2.0
494 stars 221 forks source link

Vertex AI Experiment Error (notebook 05i) #37

Closed Jay2201 closed 1 year ago

Jay2201 commented 1 year ago

getting error while logging run name in the vertex ai experiments occurs when running hyper parameter notebook.

google.api_core.exceptions.AlreadyExists: 409 Context with name projects/1234/locations/us-central1/metadataStores/ default/contexts/experiment-05-05i-tf-classification-dnn-run-20230202120231-1 already exists

I am using same code and it's giving error.

savi-bhide commented 1 year ago

Yeah, I'm also facing similar issue.

sakshi74 commented 1 year ago

I am facing the same issue.

SARANYA-J commented 1 year ago

I am experiencing the same issue.

rsher60 commented 1 year ago

using this code from 99_cleanup file.

exps = aiplatform.Experiment.list()
for exp in exps:
    runs = aiplatform.ExperimentRun.list(experiment = exp.name)
    print(f'Experiment: {exp.name}')
    for run in runs:
        print(f'Run: {run.name}')
        run.delete(delete_backing_tensorboard_run = False)
    exp.delete(delete_backing_tensorboard_runs = False)

Can you please try this before running the hyperparameter script?

Jay2201 commented 1 year ago

I already tried when there were no experiments in vertex ai experiment tabs

statmike commented 1 year ago

Is this for notebook 05i? A quick thought here. If you are running the same notebook multiple time the it can be important to rerun it starting from the very top. There is a line RUN_NAME = f'run-{TIMESTAMP}' that will use a new TIMESTAMP each time to make sure the run names end up being unique.

Jay2201 commented 1 year ago

Is this for notebook 05i? A quick thought here. If you are running the same notebook multiple time the it can be important to rerun it starting from the very top. There is a line RUN_NAME = f'run-{TIMESTAMP}' that will use a new TIMESTAMP each time to make sure the run names end up being unique.

Yes it's 05i, yes I thought the same initially so I made a timestamp dynamic also then also it's giving the same issue.

I then separately ran an HPT job without logging in any run in the Vertex Ai Experiment and it ran successfully.

statmike commented 1 year ago

Thank you. I will try to help you through the chat here. Can you let me know which part of the notebook ends in the error - which cell? From that I will have a few more steps to request so I can understand what the error is here.

Jay2201 commented 1 year ago

image

If you see the 3rd line then the run name is already dynamic because trial ID is append in the end.

Now i am getting below error in the 4th line:

google.api_core.exceptions.AlreadyExists: 409 Context with name projects/1234/locations/us-central1/metadataStores/ default/contexts/experiment-05-05i-tf-classification-dnn-run-20230202120231-1 already exists.

I won't be able to share screen shot from logs as i am using client's environment.

Even after giving the TIMESTAMP Dynamic also it gives the same error.

statmike commented 1 year ago

Hello @Jay2201, I see that you are referencing line from the script in ./code/hp_train.py.

This script is copied to GCS with a new name and then into the container used for training. The training job is created by the cell that looks like:

customJob = aiplatform.CustomJob(
    display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
    worker_pool_specs = WORKER_POOL_SPEC,
    base_output_dir = f"{URI}/models/{TIMESTAMP}",
    staging_bucket = f"{URI}/models/{TIMESTAMP}",
    labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)

This references the object WORKER_POOL_SPEC this is defined in the notebook cell with this code:

WORKER_POOL_SPEC = [
    {
        "replica_count": 1,
        "machine_spec": MACHINE_SPEC,
        "container_spec": {
            "image_uri": f"{REPOSITORY}/{EXPERIMENT}_trainer",
            "command": [],
            "args": CMDARGS
        }
    }
]

Part of this definitions is the additional object CMDARGS which is defined in the notebook with:

CMDARGS = [
    "--epochs=" + str(EPOCHS),
    "--batch_size=" + str(BATCH_SIZE),
    "--var_target=" + VAR_TARGET,
    "--var_omit=" + VAR_OMIT,
    "--project_id=" + PROJECT_ID,
    "--bq_project=" + BQ_PROJECT,
    "--bq_dataset=" + BQ_DATASET,
    "--bq_table=" + BQ_TABLE,
    "--region=" + REGION,
    "--experiment=" + EXPERIMENT,
    "--series=" + SERIES,
    "--experiment_name=" + EXPERIMENT_NAME,
    "--run_name=" + RUN_NAME
]

This is where the value of experiment_name and run_name get passed in. The are defined at the top of the notebook in the cell that looks like:

FRAMEWORK = 'tf'
TASK = 'classification'
MODEL_TYPE = 'dnn'
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-{FRAMEWORK}-{TASK}-{MODEL_TYPE}'
RUN_NAME = f'run-{TIMESTAMP}'

The unique part of this will be RUN_NAME because it has a value TIMESTAMP that is also defined near the top of the notebook in a cell that looks like:

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{SERIES}/{EXPERIMENT}"
DIR = f"temp/{EXPERIMENT}"

It looks like the value of TIMESTAMP the notebook is using on your run may have already been used before. Is this possible?

The only other possibility I can think of is multiple values of the hpt.trial_id are the same but I have not run into that before.

Thank You

Jay2201 commented 1 year ago

Yes @statmike ,

while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.

The point of hpt.trial_id might be the issue which i am also guessing, but didn't got any resolution about it yet

Even i ran the 05i notebook in november 2022 but at that time i did not faced any issues.

savi-bhide commented 1 year ago

Yes @statmike ,

while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.

The point of hpt.trial_id might be the issue which i am also guessing, but didn't got any resolution about it yet

Even i ran the 05i notebook in november 2022 but at that time i did not faced any issues.

Yeah, this issue looks similar to mine.

sakshi74 commented 1 year ago

Yes @statmike , while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error. The point of hpt.trial_id might be the issue which i am also guessing, but didn't got any resolution about it yet Even i ran the 05i notebook in november 2022 but at that time i did not faced any issues.

Its same for me too.

My issue is also the same.

statmike commented 1 year ago

Hello @Jay2201 , I just did a test run of the notebook in an environment where it was run previously and did not encounter any errors. I am going to cover the diagnostics I did here in case you want to replicate the steps for troubleshooting in your environment.

On the Vertex AI Console Page for Training, HyperParameter Tuning Jobs tab, select the current job related to the notebook. This gives a list of all the tuning trials and includes links to the logs for each:

Screenshot 2023-02-03 at 10 09 03 AM

I went to the logs for each of these tuning trials and looked for the result of the line that creates the experiment run:

expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

Here are the values I found in the logs for the first 6 trials:

Are you making any other changes to the tutorial notebook that might need to be investigated for causing this issue? Thank You

Jay2201 commented 1 year ago

Let me again run the notebook and see if I am getting the same error or not

As far as I remember i am only changing region rest code i am running as is

sakshi74 commented 1 year ago

Hi @statmike,

I tried the same but I am using REGION = "europe-west2". I am getting the same error mentioned below.

google.api_core.exceptions.AlreadyExists: 409 Context with name projects/123456/locations/europe-west2/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203194023-1 already exists

Seems like its a region specific issue. If possible could you please try it with REGION= "europe-west2".

Thanks in advance!

Jay2201 commented 1 year ago

Hi @statmike,

I tried the same but I am using REGION = "europe-west2". I am getting the same error mentioned below.

google.api_core.exceptions.AlreadyExists: 409 Context with name projects/123456/locations/europe-west2/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203194023-1 already exists

Seems like its a region specific issue. If possible could you please try it with REGION= "europe-west2".

Thanks in advance!

@statmike - Getting the same error with the above region ....

statmike commented 1 year ago

Hello @Jay2201 , When you run this job what are you using for parallel_trial_count? The example uses 3. If you go to the logs for each of the parallel jobs do the all have this error or does one of them succeed in Associating projects/... to Experiment: ...

Jay2201 commented 1 year ago

Hello @Jay2201 , When you run this job what are you using for parallel_trial_count? The example uses 3. If you go to the logs for each of the parallel jobs do the all have this error or does one of them succeed in Associating projects/... to Experiment: ...

I have tried 3 and 2 both for Parallel Trial Count. I have same error for all..

statmike commented 1 year ago

Hello @Jay2201 , If all of the initial set of trials specified by parallel_trial_count are giving the same error then it seems to indicate the runs are being created before the job. I have some ideas for diagnostics here.

Initialize Parameters and Clients:

PROJECT_ID = <your project here>
REGION = 'europe-west2'
EXPERIMENT_NAME = 'experiment-05-05i-tf-classification-dnn'
TIMESTAMP = 20230203194023

from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=REGION)

Return the known runs for the experiment:

exp = aiplatform.Experiment(experiment_name = EXPERIMENT_NAME)
exp_runs = exp.get_data_frame()
exp_runs

If needed, subset to the runs for the specific TIMESTAMP value:

exp_runs[exp_runs['run_name'].str.contains(f'run-{TIMESTAMP}')]

Let me know how the results of these checks for the experiment and logged runs work out. Thank You

statmike commented 1 year ago

Hi @Jay2201 , Have you had any luck troubleshooting the run name already exisitng?

I gave this some thought over the last week. In cases where a run name may already exist it could be desirable to add to the experiment run or overwrite information based on updated data. I made a small alteration to accommodate this which might also help in your situation. If the run name is already defined it will attach to it rather than try to create a new run with the same name. Inside the scripts that each of the 05a-05i call the following change has been made:

Before:

# Vertex AI Experiment
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

After:

# Vertex AI Experiment
if args.run_name in [run.name for run in aiplatform.ExperimentRun.list(experiment = args.experiment_name)]:
    expRun = aiplatform.ExperimentRun(run_name = args.run_name, experiment = args.experiment_name)
else:
    expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
Jay2201 commented 1 year ago

Hi @Jay2201 , Have you had any luck troubleshooting the run name already exisitng?

I gave this some thought over the last week. In cases where a run name may already exist it could be desirable to add to the experiment run or overwrite information based on updated data. I made a small alteration to accommodate this which might also help in your situation. If the run name is already defined it will attach to it rather than try to create a new run with the same name. Inside the scripts that each of the 05a-05i call the following change has been made:

Before:

# Vertex AI Experiment
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

After:

# Vertex AI Experiment
if args.run_name in [run.name for run in aiplatform.ExperimentRun.list(experiment = args.experiment_name)]:
    expRun = aiplatform.ExperimentRun(run_name = args.run_name, experiment = args.experiment_name)
else:
    expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)

Thanks @statmike i will definitely check this and will update you, actually busy with other task so not getting time.

Jay2201 commented 1 year ago

Hey @statmike, sorry to reply you late I tested and it runs fine on your notebook, but as I am using multiple GPUs so I have multiple runs which are not updating in the vertex ai experiments. Thanks for the solution 🙂