Closed Jay2201 closed 1 year ago
Yeah, I'm also facing similar issue.
I am facing the same issue.
I am experiencing the same issue.
using this code from 99_cleanup file.
exps = aiplatform.Experiment.list()
for exp in exps:
runs = aiplatform.ExperimentRun.list(experiment = exp.name)
print(f'Experiment: {exp.name}')
for run in runs:
print(f'Run: {run.name}')
run.delete(delete_backing_tensorboard_run = False)
exp.delete(delete_backing_tensorboard_runs = False)
Can you please try this before running the hyperparameter script?
I already tried when there were no experiments in vertex ai experiment tabs
Is this for notebook 05i? A quick thought here. If you are running the same notebook multiple time the it can be important to rerun it starting from the very top. There is a line RUN_NAME = f'run-{TIMESTAMP}'
that will use a new TIMESTAMP each time to make sure the run names end up being unique.
Is this for notebook 05i? A quick thought here. If you are running the same notebook multiple time the it can be important to rerun it starting from the very top. There is a line
RUN_NAME = f'run-{TIMESTAMP}'
that will use a new TIMESTAMP each time to make sure the run names end up being unique.
Yes it's 05i, yes I thought the same initially so I made a timestamp dynamic also then also it's giving the same issue.
I then separately ran an HPT job without logging in any run in the Vertex Ai Experiment and it ran successfully.
Thank you. I will try to help you through the chat here. Can you let me know which part of the notebook ends in the error - which cell? From that I will have a few more steps to request so I can understand what the error is here.
If you see the 3rd line then the run name is already dynamic because trial ID is append in the end.
Now i am getting below error in the 4th line:
google.api_core.exceptions.AlreadyExists: 409 Context with name projects/1234/locations/us-central1/metadataStores/ default/contexts/experiment-05-05i-tf-classification-dnn-run-20230202120231-1 already exists
.
I won't be able to share screen shot from logs as i am using client's environment.
Even after giving the TIMESTAMP Dynamic also it gives the same error.
Hello @Jay2201,
I see that you are referencing line from the script in ./code/hp_train.py
.
This script is copied to GCS with a new name and then into the container used for training. The training job is created by the cell that looks like:
customJob = aiplatform.CustomJob(
display_name = f'{SERIES}_{EXPERIMENT}_{TIMESTAMP}',
worker_pool_specs = WORKER_POOL_SPEC,
base_output_dir = f"{URI}/models/{TIMESTAMP}",
staging_bucket = f"{URI}/models/{TIMESTAMP}",
labels = {'series' : f'{SERIES}', 'experiment' : f'{EXPERIMENT}', 'experiment_name' : f'{EXPERIMENT_NAME}', 'run_name' : f'{RUN_NAME}'}
)
This references the object WORKER_POOL_SPEC
this is defined in the notebook cell with this code:
WORKER_POOL_SPEC = [
{
"replica_count": 1,
"machine_spec": MACHINE_SPEC,
"container_spec": {
"image_uri": f"{REPOSITORY}/{EXPERIMENT}_trainer",
"command": [],
"args": CMDARGS
}
}
]
Part of this definitions is the additional object CMDARGS
which is defined in the notebook with:
CMDARGS = [
"--epochs=" + str(EPOCHS),
"--batch_size=" + str(BATCH_SIZE),
"--var_target=" + VAR_TARGET,
"--var_omit=" + VAR_OMIT,
"--project_id=" + PROJECT_ID,
"--bq_project=" + BQ_PROJECT,
"--bq_dataset=" + BQ_DATASET,
"--bq_table=" + BQ_TABLE,
"--region=" + REGION,
"--experiment=" + EXPERIMENT,
"--series=" + SERIES,
"--experiment_name=" + EXPERIMENT_NAME,
"--run_name=" + RUN_NAME
]
This is where the value of experiment_name
and run_name
get passed in. The are defined at the top of the notebook in the cell that looks like:
FRAMEWORK = 'tf'
TASK = 'classification'
MODEL_TYPE = 'dnn'
EXPERIMENT_NAME = f'experiment-{SERIES}-{EXPERIMENT}-{FRAMEWORK}-{TASK}-{MODEL_TYPE}'
RUN_NAME = f'run-{TIMESTAMP}'
The unique part of this will be RUN_NAME
because it has a value TIMESTAMP
that is also defined near the top of the notebook in a cell that looks like:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
BUCKET = PROJECT_ID
URI = f"gs://{BUCKET}/{SERIES}/{EXPERIMENT}"
DIR = f"temp/{EXPERIMENT}"
It looks like the value of TIMESTAMP the notebook is using on your run may have already been used before. Is this possible?
The only other possibility I can think of is multiple values of the hpt.trial_id
are the same but I have not run into that before.
Thank You
Yes @statmike ,
while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.
The point of hpt.trial_id
might be the issue which i am also guessing, but didn't got any resolution about it yet
Even i ran the 05i
notebook in november 2022 but at that time i did not faced any issues.
Yes @statmike ,
while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error.
The point of
hpt.trial_id
might be the issue which i am also guessing, but didn't got any resolution about it yetEven i ran the
05i
notebook in november 2022 but at that time i did not faced any issues.
Yeah, this issue looks similar to mine.
Yes @statmike , while running the notebook i have made sure that TIMESTAMP value changes every time then also i got error. The point of
hpt.trial_id
might be the issue which i am also guessing, but didn't got any resolution about it yet Even i ran the05i
notebook in november 2022 but at that time i did not faced any issues.Its same for me too.
My issue is also the same.
Hello @Jay2201 , I just did a test run of the notebook in an environment where it was run previously and did not encounter any errors. I am going to cover the diagnostics I did here in case you want to replicate the steps for troubleshooting in your environment.
On the Vertex AI Console Page for Training, HyperParameter Tuning Jobs tab, select the current job related to the notebook. This gives a list of all the tuning trials and includes links to the logs for each:
I went to the logs for each of these tuning trials and looked for the result of the line that creates the experiment run:
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
Here are the values I found in the logs for the first 6 trials:
Are you making any other changes to the tutorial notebook that might need to be investigated for causing this issue? Thank You
Let me again run the notebook and see if I am getting the same error or not
As far as I remember i am only changing region rest code i am running as is
Hi @statmike,
I tried the same but I am using REGION = "europe-west2". I am getting the same error mentioned below.
google.api_core.exceptions.AlreadyExists: 409 Context with name projects/123456/locations/europe-west2/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203194023-1 already exists
Seems like its a region specific issue. If possible could you please try it with REGION= "europe-west2".
Thanks in advance!
Hi @statmike,
I tried the same but I am using REGION = "europe-west2". I am getting the same error mentioned below.
google.api_core.exceptions.AlreadyExists: 409 Context with name projects/123456/locations/europe-west2/metadataStores/default/contexts/experiment-05-05i-tf-classification-dnn-run-20230203194023-1 already exists
Seems like its a region specific issue. If possible could you please try it with REGION= "europe-west2".
Thanks in advance!
@statmike - Getting the same error with the above region ....
Hello @Jay2201 , When you run this job what are you using for parallel_trial_count
? The example uses 3. If you go to the logs for each of the parallel jobs do the all have this error or does one of them succeed in Associating projects/... to Experiment: ...
Hello @Jay2201 , When you run this job what are you using for
parallel_trial_count
? The example uses 3. If you go to the logs for each of the parallel jobs do the all have this error or does one of them succeed inAssociating projects/... to Experiment: ...
I have tried 3 and 2 both for Parallel Trial Count. I have same error for all..
Hello @Jay2201 ,
If all of the initial set of trials specified by parallel_trial_count
are giving the same error then it seems to indicate the runs are being created before the job. I have some ideas for diagnostics here.
Initialize Parameters and Clients:
PROJECT_ID = <your project here>
REGION = 'europe-west2'
EXPERIMENT_NAME = 'experiment-05-05i-tf-classification-dnn'
TIMESTAMP = 20230203194023
from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID, location=REGION)
Return the known runs for the experiment:
exp = aiplatform.Experiment(experiment_name = EXPERIMENT_NAME)
exp_runs = exp.get_data_frame()
exp_runs
If needed, subset to the runs for the specific TIMESTAMP value:
exp_runs[exp_runs['run_name'].str.contains(f'run-{TIMESTAMP}')]
Let me know how the results of these checks for the experiment and logged runs work out. Thank You
Hi @Jay2201 , Have you had any luck troubleshooting the run name already exisitng?
I gave this some thought over the last week. In cases where a run name may already exist it could be desirable to add to the experiment run or overwrite information based on updated data. I made a small alteration to accommodate this which might also help in your situation. If the run name is already defined it will attach to it rather than try to create a new run with the same name. Inside the scripts that each of the 05a-05i call the following change has been made:
Before:
# Vertex AI Experiment
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
After:
# Vertex AI Experiment
if args.run_name in [run.name for run in aiplatform.ExperimentRun.list(experiment = args.experiment_name)]:
expRun = aiplatform.ExperimentRun(run_name = args.run_name, experiment = args.experiment_name)
else:
expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
Hi @Jay2201 , Have you had any luck troubleshooting the run name already exisitng?
I gave this some thought over the last week. In cases where a run name may already exist it could be desirable to add to the experiment run or overwrite information based on updated data. I made a small alteration to accommodate this which might also help in your situation. If the run name is already defined it will attach to it rather than try to create a new run with the same name. Inside the scripts that each of the 05a-05i call the following change has been made:
Before:
# Vertex AI Experiment expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
After:
# Vertex AI Experiment if args.run_name in [run.name for run in aiplatform.ExperimentRun.list(experiment = args.experiment_name)]: expRun = aiplatform.ExperimentRun(run_name = args.run_name, experiment = args.experiment_name) else: expRun = aiplatform.ExperimentRun.create(run_name = args.run_name, experiment = args.experiment_name)
Thanks @statmike i will definitely check this and will update you, actually busy with other task so not getting time.
Hey @statmike, sorry to reply you late I tested and it runs fine on your notebook, but as I am using multiple GPUs so I have multiple runs which are not updating in the vertex ai experiments. Thanks for the solution 🙂
getting error while logging run name in the vertex ai experiments occurs when running hyper parameter notebook.
google.api_core.exceptions.AlreadyExists: 409 Context with name projects/1234/locations/us-central1/metadataStores/ default/contexts/experiment-05-05i-tf-classification-dnn-run-20230202120231-1 already exists
I am using same code and it's giving error.