Closed hugoferrero closed 1 year ago
@TheMichaelHu is the code owner of CustomContainerTrainingJob. Michale, PTAL.
@hugoferrero Please share the underlying resource protos for both jobs and indicate the source of creation for each. You can get these through the SDK:
aiplatform.CustomContainerTrainingJob.get(resource_name).gca_resource
Hi @sasha-gitg . Problem was solved; In the case create_training_pipeline_custom_container_job_sample
i upgraded the tf version image in container to TF 2.9 and, in the case custom_training_job_sample
, problem solved itself. Thank you anyway
Hi. I'm training a model by using Vertex Training service. The training is ok when i use the console ("Create" button) but when i try to train the model using the sdk, every epoch outputs "nan" in training metrics. I'm using a script from this tutorial: https://codelabs.developers.google.com/codelabs/vertex-ai-custom-models#3
This is the python script:
I'm trying to train this model by using the SDK ("CustomContainerTrainingJob" and "CustomTrainingJob" - version google-cloud-aiplatform = 1.13.0). The logs, in both cases, is the same:
This is the code in every case:
And this is the log i get when using console ("Create button"):
Any suggestions?...Thanks in advance.