tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.org/tfx
Apache License 2.0
2.11k stars 704 forks source link

CloudTuner error #5141

Closed deep-diver closed 2 years ago

deep-diver commented 2 years ago

I have successfully initiated Vizier Job via CloudTuner, but it failed.

I have looked into the logs, but there was no errors occurred, and the training was successfully done. Could you take a look what happened? The logs should be read from bottom to top.

jsonPayload.message
--
Job failed.
Finished tearing down training program.
2022/08/13 04:24:30 No id provided.
. Setting to DenseTensor.
}
size: 1
I0813 04:24:27.917382 139812382340928 tensor_representation_util.py:347] Feature label_xf has a shape dim {
. Setting to DenseTensor.
}
size: 3
dim {
}
size: 224
dim {
}
size: 224
I0813 04:24:27.917122 139812382340928 tensor_representation_util.py:347] Feature image_xf has a shape dim {
. Setting to DenseTensor.
}
size: 1
I0813 04:24:27.783987 139812382340928 tensor_representation_util.py:347] Feature label_xf has a shape dim {
. Setting to DenseTensor.
}
size: 3
dim {
}
size: 224
dim {
}
size: 224
I0813 04:24:27.783725 139812382340928 tensor_representation_util.py:347] Feature image_xf has a shape dim {
. Setting to DenseTensor.
}
size: 1
I0813 04:24:27.573800 139812382340928 tensor_representation_util.py:347] Feature label_xf has a shape dim {
. Setting to DenseTensor.
}
size: 3
dim {
}
size: 224
dim {
}
size: 224
I0813 04:24:27.573541 139812382340928 tensor_representation_util.py:347] Feature image_xf has a shape dim {
. Setting to DenseTensor.
}
size: 1
I0813 04:24:27.098982 139812382340928 tensor_representation_util.py:347] Feature label_xf has a shape dim {
. Setting to DenseTensor.
}
size: 3
dim {
}
size: 224
dim {
}
size: 224
I0813 04:24:27.098610 139812382340928 tensor_representation_util.py:347] Feature image_xf has a shape dim {
I0813 04:24:26.905468 139812382340928 model.py:33] _________________________________________________________________
I0813 04:24:26.905363 139812382340928 model.py:33] Non-trainable params: 23,587,712
I0813 04:24:26.905245 139812382340928 model.py:33] Trainable params: 20,490
I0813 04:24:26.905140 139812382340928 model.py:33] Total params: 23,608,202
I0813 04:24:26.900732 139812382340928 model.py:33] =================================================================
I0813 04:24:26.900615 139812382340928 model.py:33]
I0813 04:24:26.900457 139812382340928 model.py:33]  dense (Dense)               (None, 10)                20490
I0813 04:24:26.900074 139812382340928 model.py:33]
I0813 04:24:26.899939 139812382340928 model.py:33]  dropout (Dropout)           (None, 2048)              0
I0813 04:24:26.899660 139812382340928 model.py:33]
I0813 04:24:26.899552 139812382340928 model.py:33]  resnet50 (Functional)       (None, 2048)              23587712
I0813 04:24:26.895043 139812382340928 model.py:33] =================================================================
I0813 04:24:26.894924 139812382340928 model.py:33]  Layer (type)                Output Shape              Param #
I0813 04:24:26.894763 139812382340928 model.py:33] _________________________________________________________________
I0813 04:24:26.894547 139812382340928 model.py:33] Model: "sequential"
8192/94765736 [..............................] - ETA: 0s  5955584/94765736 [>.............................] - ETA: 0s 14000128/94765736 [===>..........................] - ETA: 0s 20971520/94765736 [=====>........................] - ETA: 0s 28442624/94765736 [========>.....................] - ETA: 0s 36356096/94765736 [==========>...................] - ETA: 0s 44326912/94765736 [=============>................] - ETA: 0s 52133888/94765736 [===============>..............] - ETA: 0s 60121088/94765736 [==================>...........] - ETA: 0s 67960832/94765736 [====================>.........] - ETA: 0s 75710464/94765736 [======================>.......] - ETA: 0s 83501056/94765736 [=========================>....] - ETA: 0s 91258880/94765736 [===========================>..] - ETA: 0s 94765736/94765736 [==============================] - 1s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
 
tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().
to opt-out, you may do so by running
please refer to https://policies.google.com/privacy. If you wish
Cloud Services in accordance with Google privacy policy, for more information
This application reports technical and operational details of your usage of
 
2022-08-13 04:24:23.295037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10807 MB memory:  -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7
2022-08-13 04:24:23.294182: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:23.293289: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:23.292287: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:22.735514: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:22.734559: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:22.733457: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-13 04:24:22.732728: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
I0813 04:24:21.739562 139812382340928 google_api_client.py:132] Detected running in DL_CONTAINER environment.
Load existing study...
I0813 04:24:21.737804 139812382340928 tuner.py:197] Study already exists: projects/gcp-ml-172005/locations/us-central1/studies/CloudTuner_study_20220813_042421.
Load existing study...
INFO:tensorflow:Study already exists: projects/gcp-ml-172005/locations/us-central1/studies/CloudTuner_study_20220813_042421.
I0813 04:24:21.696875 139812382340928 tuner.py:197] {'name': 'projects/874401645461/locations/us-central1/studies/CloudTuner_study_20220813_042421', 'studyConfig': {'metrics': [{'goal': 'MAXIMIZE', 'metric': 'val_sparse_categorical_accuracy'}], 'parameters': [{'parameter': 'learning_rate', 'type': 'DISCRETE', 'discreteValueSpec': {'values': [0.001, 0.01]}}], 'automatedStoppingConfig': {'decayCurveStoppingConfig': {'useElapsedTime': True}}}, 'state': 'ACTIVE', 'createTime': '2022-08-13T04:24:21Z'}
INFO:tensorflow:{'name': 'projects/874401645461/locations/us-central1/studies/CloudTuner_study_20220813_042421', 'studyConfig': {'metrics': [{'goal': 'MAXIMIZE', 'metric': 'val_sparse_categorical_accuracy'}], 'parameters': [{'parameter': 'learning_rate', 'type': 'DISCRETE', 'discreteValueSpec': {'values': [0.001, 0.01]}}], 'automatedStoppingConfig': {'decayCurveStoppingConfig': {'useElapsedTime': True}}}, 'state': 'ACTIVE', 'createTime': '2022-08-13T04:24:21Z'}
I0813 04:24:21.171569 139812382340928 google_api_client.py:132] Detected running in DL_CONTAINER environment.
I0813 04:24:21.171575 139812382340928 google_api_client.py:132] Detected running in DL_CONTAINER environment.
 
tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().
to opt-out, you may do so by running
please refer to https://policies.google.com/privacy. If you wish
Cloud Services in accordance with Google privacy policy, for more information
This application reports technical and operational details of your usage of
I0813 04:24:21.170827 139812382340928 google_api_client.py:185]
 
tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().
to opt-out, you may do so by running
please refer to https://policies.google.com/privacy. If you wish
Cloud Services in accordance with Google privacy policy, for more information
This application reports technical and operational details of your usage of
I0813 04:24:21.170828 139812382340928 google_api_client.py:185]
W0813 04:24:21.157318 139812382340928 examples_utils.py:50] Examples artifact does not have payload_format custom property. Falling back to FORMAT_TF_EXAMPLE
W0813 04:24:21.157322 139812382340928 examples_utils.py:50] Examples artifact does not have payload_format custom property. Falling back to FORMAT_TF_EXAMPLE
W0813 04:24:21.157067 139812382340928 examples_utils.py:50] Examples artifact does not have payload_format custom property. Falling back to FORMAT_TF_EXAMPLE
W0813 04:24:21.157056 139812382340928 examples_utils.py:50] Examples artifact does not have payload_format custom property. Falling back to FORMAT_TF_EXAMPLE
W0813 04:24:21.156749 139812382340928 examples_utils.py:50] Examples artifact does not have payload_format custom property. Falling back to FORMAT_TF_EXAMPLE
W0813 04:24:21.156746 139812382340928 examples_utils.py:50] Examples artifact does not have payload_format custom property. Falling back to FORMAT_TF_EXAMPLE
I0813 04:24:21.156335 139812382340928 fn_args_utils.py:138] Evaluate on the 'eval' split when eval_args.splits is not set.
I0813 04:24:21.156317 139812382340928 fn_args_utils.py:138] Evaluate on the 'eval' split when eval_args.splits is not set.
I0813 04:24:21.156160 139812382340928 fn_args_utils.py:134] Train on the 'train' split when train_args.splits is not set.
I0813 04:24:21.156160 139812382340928 fn_args_utils.py:134] Train on the 'train' split when train_args.splits is not set.
I0813 04:24:20.723299 139812382340928 udf_utils.py:48] udf_utils.get_fn {'custom_config': '{"ai_platform_tuning_args": {"masterConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test"}, "masterType": "n1-standard-4", "project": "gcp-ml-172005", "region": "us-central1", "scaleTier": "CUSTOM", "serviceAccount": "vizier@gcp-ml-172005.iam.gserviceaccount.com", "workerConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test"}, "workerCount": 3, "workerType": "n1-standard-4"}, "remote_trials_working_dir": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials"}', 'eval_args': '{\n  "num_steps": 4\n}', 'train_args': '{\n  "num_steps": 160\n}', 'tune_args': '{\n  "num_parallel_trials": 3\n}', 'tuner_fn': 'models.model.cloud_tuner_fn'} 'tuner_fn'
I0813 04:24:20.723111 139812382340928 executor.py:212] Binding chief oracle server at: 0.0.0.0:2222
I0813 04:24:20.722659 139812382340928 executor.py:200] chief_oracle() starting...
I0813 04:24:20.722256 139812382340928 udf_utils.py:48] udf_utils.get_fn {'custom_config': '{"ai_platform_tuning_args": {"masterConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test"}, "masterType": "n1-standard-4", "project": "gcp-ml-172005", "region": "us-central1", "scaleTier": "CUSTOM", "serviceAccount": "vizier@gcp-ml-172005.iam.gserviceaccount.com", "workerConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test"}, "workerCount": 3, "workerType": "n1-standard-4"}, "remote_trials_working_dir": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials"}', 'eval_args': '{\n  "num_steps": 4\n}', 'train_args': '{\n  "num_steps": 160\n}', 'tune_args': '{\n  "num_parallel_trials": 3\n}', 'tuner_fn': 'models.model.cloud_tuner_fn'} 'tuner_fn'
I0813 04:24:20.722024 139812382340928 executor.py:275] Setting KERASTUNER_TUNER_ID with tfx-tuner-master-0
I0813 04:24:20.721865 139812382340928 executor.py:267] Oracle chief is known to be at: cmle-training-master-afa651e2fc-0:2222
I0813 04:24:20.720906 139812382340928 executor.py:233] Chief oracle started at PID: 16
I0813 04:24:20.710414 139812382340928 run_executor.py:155] Starting executor
I0813 04:24:20.709932 139812382340928 executor.py:332] Tuner ID is: tfx-tuner-master-0
I0813 04:24:20.709692 139812382340928 executor.py:300] Cluster spec initalized with: {'cluster': {'master': ['cmle-training-master-afa651e2fc-0:2222'], 'worker': ['cmle-training-worker-afa651e2fc-0:2222', 'cmle-training-worker-afa651e2fc-1:2222']}, 'environment': 'cloud', 'task': {'type': 'master', 'index': 0}, 'job': '{\n  "scale_tier": "CUSTOM",\n  "master_type": "n1-standard-4",\n  "worker_type": "n1-standard-4",\n  "worker_count": "2",\n  "region": "us-central1",\n  "master_config": {\n    "accelerator_config": {\n      "count": "1",\n      "type": "NVIDIA_TESLA_K80"\n    },\n    "image_uri": "gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test",\n    "container_command": ["python", "-m", "tfx.scripts.run_executor", "--executor_class_path", "tfx.extensions.google_cloud_ai_platform.tuner.executor._WorkerExecutor", "--inputs", "{\\"transform_graph\\": [{\\"artifact\\": {\\"id\\": \\"3040439057790690801\\", \\"uri\\": \\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Transform_2187476734192910336/transform_graph\\", \\"custom_properties\\": {\\"tfx_version\\": {\\"struct_value\\": {\\"__value__\\": \\"1.9.1\\"}}}}, \\"artifact_type\\": {\\"name\\": \\"TransformGraph\\"}, \\"__artifact_class_module__\\": \\"tfx.types.standard_artifacts\\", \\"__artifact_class_name__\\": \\"TransformGraph\\"}], \\"examples\\": [{\\"artifact\\": {\\"id\\": \\"6958007971664455536\\", \\"uri\\": \\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Transform_2187476734192910336/transformed_examples\\", \\"properties\\": {\\"split_names\\": {\\"string_value\\": \\"[\\\\\\"eval\\\\\\", \\\\\\"train\\\\\\"]\\"}}, \\"custom_properties\\": {\\"tfx_version\\": {\\"struct_value\\": {\\"__value__\\": \\"1.9.1\\"}}}}, \\"artifact_type\\": {\\"name\\": \\"Examples\\", \\"properties\\": {\\"span\\": \\"INT\\", \\"split_names\\": \\"STRING\\", \\"version\\": \\"INT\\"}, \\"base_type\\": \\"DATASET\\"}, \\"__artifact_class_module__\\": \\"tfx.types.standard_artifacts\\", \\"__artifact_class_name__\\": \\"Examples\\"}]}", "--outputs", "{\\"best_hyperparameters\\": [{\\"artifact\\": {\\"id\\": \\"3312416091851715625\\", \\"uri\\": \\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Tuner_-7035895302661865472/best_hyperparameters\\"}, \\"artifact_type\\": {\\"name\\": \\"HyperParameters\\"}, \\"__artifact_class_module__\\": \\"tfx.types.standard_artifacts\\", \\"__artifact_class_name__\\": \\"HyperParameters\\"}]}", "--exec-properties", "{\\"custom_config\\": \\"{\\\\\\"ai_platform_tuning_args\\\\\\": {\\\\\\"masterConfig\\\\\\": {\\\\\\"acceleratorConfig\\\\\\": {\\\\\\"count\\\\\\": 1, \\\\\\"type\\\\\\": \\\\\\"NVIDIA_TESLA_K80\\\\\\"}, \\\\\\"imageUri\\\\\\": \\\\\\"gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test\\\\\\"}, \\\\\\"masterType\\\\\\": \\\\\\"n1-standard-4\\\\\\", \\\\\\"project\\\\\\": \\\\\\"gcp-ml-172005\\\\\\", \\\\\\"region\\\\\\": \\\\\\"us-central1\\\\\\", \\\\\\"scaleTier\\\\\\": \\\\\\"CUSTOM\\\\\\", \\\\\\"serviceAccount\\\\\\": \\\\\\"vizier@gcp-ml-172005.iam.gserviceaccount.com\\\\\\", \\\\\\"workerConfig\\\\\\": {\\\\\\"acceleratorConfig\\\\\\": {\\\\\\"count\\\\\\": 1, \\\\\\"type\\\\\\": \\\\\\"NVIDIA_TESLA_K80\\\\\\"}, \\\\\\"imageUri\\\\\\": \\\\\\"gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test\\\\\\"}, \\\\\\"workerCount\\\\\\": 3, \\\\\\"workerType\\\\\\": \\\\\\"n1-standard-4\\\\\\"}, \\\\\\"remote_trials_working_dir\\\\\\": \\\\\\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials\\\\\\"}\\", \\"eval_args\\": \\"{\\\\n  \\\\\\"num_steps\\\\\\": 4\\\\n}\\", \\"train_args\\": \\"{\\\\n  \\\\\\"num_steps\\\\\\": 160\\\\n}\\", \\"tune_args\\": \\"{\\\\n  \\\\\\"num_parallel_trials\\\\\\": 3\\\\n}\\", \\"tuner_fn\\": \\"models.model.cloud_tuner_fn\\"}"]\n  },\n  "worker_config": {\n    "accelerator_config": {\n      "count": "1",\n      "type": "NVIDIA_TESLA_K80"\n    },\n    "image_uri": "gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test"\n  },\n  "service_account": "vizier@gcp-ml-172005.iam.gserviceaccount.com"\n}'}
I0813 04:24:20.709398 139812382340928 executor.py:292] Initializing cluster spec...
 
 
I0813 04:24:16.823370 139812382340928 executor.py:43] tensorflow_text is not available: No module named 'tensorflow_text'
I0813 04:24:16.796857 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.796637 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.795921 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.795674 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.784358 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.783092 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.782919 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.781281 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.780695 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.780493 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.779530 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.779021 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.415254 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.415065 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.414037 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.413864 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.412977 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.412753 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.412003 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.411813 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.410855 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.410645 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.409852 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.409629 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.408295 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.408134 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.407290 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.407044 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.277025 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.276487 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.275387 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.274608 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.274452 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.211828 139812382340928 model_util.py:68] struct2tensor is not available: No module named 'struct2tensor'
I0813 04:24:16.211477 139812382340928 model_util.py:63] tensorflow_decision_forests is not available: No module named 'tensorflow_decision_forests'
I0813 04:24:16.211113 139812382340928 model_util.py:58] tensorflow_text is not available: No module named 'tensorflow_text'
I0813 04:24:16.210595 139812382340928 model_util.py:53] tensorflow_ranking is not available: No module named 'tensorflow_ranking'
I0813 04:24:16.210203 139812382340928 model_util.py:44] imported tensorflow_io
I0813 04:24:15.824912 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:15.824463 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
)]}, exec_properties: {'custom_config': '{"ai_platform_tuning_args": {"masterConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test"}, "masterType": "n1-standard-4", "project": "gcp-ml-172005", "region": "us-central1", "scaleTier": "CUSTOM", "serviceAccount": "vizier@gcp-ml-172005.iam.gserviceaccount.com", "workerConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test"}, "workerCount": 3, "workerType": "n1-standard-4"}, "remote_trials_working_dir": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials"}', 'eval_args': '{\n  "num_steps": 4\n}', 'train_args': '{\n  "num_steps": 160\n}', 'tune_args': '{\n  "num_parallel_trials": 3\n}', 'tuner_fn': 'models.model.cloud_tuner_fn'}
, artifact_type: name: "HyperParameters"
uri: "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Tuner_-7035895302661865472/best_hyperparameters"
)]}, outputs: {'best_hyperparameters': [Artifact(artifact: id: 3312416091851715625
base_type: DATASET
}
value: INT
key: "version"
properties {
}
value: STRING
key: "split_names"
properties {
}
value: INT
key: "span"
properties {
, artifact_type: name: "Examples"
}
}
}
}
}
string_value: "1.9.1"
value {
key: "__value__"
fields {
struct_value {
value {
key: "tfx_version"
custom_properties {
}
}
string_value: "[\"eval\", \"train\"]"
value {
key: "split_names"
properties {
uri: "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Transform_2187476734192910336/transformed_examples"
)], 'examples': [Artifact(artifact: id: 6958007971664455536
, artifact_type: name: "TransformGraph"
}
}
}
}
}
string_value: "1.9.1"
value {
key: "__value__"
fields {
struct_value {
value {
key: "tfx_version"
custom_properties {
uri: "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Transform_2187476734192910336/transform_graph"
I0813 04:24:15.720419 139812382340928 run_executor.py:141] Executor tfx.extensions.google_cloud_ai_platform.tuner.executor._WorkerExecutor do: inputs: {'transform_graph': [Artifact(artifact: id: 3040439057790690801
2022-08-13 04:24:15.695750: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:15.694653: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:15.510332: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022/08/13 04:24:10 No id provided.
File system has been successfully mounted.
Mounting file system "gcsfuse"...
Opening GCS connection...
File system has been successfully mounted.
Mounting file system "gcsfuse"...
Opening GCS connection...
File system has been successfully mounted.
Mounting file system "gcsfuse"...
Opening GCS connection...
 
 
 
Job tfx_tuner_20220813041519 is queued.
 
Job creation request has been successfully validated.

jsonPayload.message
Job failed.
Finished tearing down training program.
2022/08/13 04:25:56 No id provided.
2022/08/13 04:25:46 No id provided.
2022/08/13 04:25:01 No id provided.
2022/08/13 04:24:54 No id provided.
2022/08/13 04:24:30 No id provided.
. Setting to DenseTensor.
}
size: 1
I0813 04:24:27.917382 139812382340928 tensor_representation_util.py:347] Feature label_xf has a shape dim {
. Setting to DenseTensor.
}
size: 3
dim {
}
size: 224
dim {
}
size: 224
I0813 04:24:27.917122 139812382340928 tensor_representation_util.py:347] Feature image_xf has a shape dim {
. Setting to DenseTensor.
}
size: 1
I0813 04:24:27.783987 139812382340928 tensor_representation_util.py:347] Feature label_xf has a shape dim {
. Setting to DenseTensor.
}
size: 3
dim {
}
size: 224
dim {
}
size: 224
I0813 04:24:27.783725 139812382340928 tensor_representation_util.py:347] Feature image_xf has a shape dim {
. Setting to DenseTensor.
}
size: 1
I0813 04:24:27.573800 139812382340928 tensor_representation_util.py:347] Feature label_xf has a shape dim {
. Setting to DenseTensor.
}
size: 3
dim {
}
size: 224
dim {
}
size: 224
I0813 04:24:27.573541 139812382340928 tensor_representation_util.py:347] Feature image_xf has a shape dim {
. Setting to DenseTensor.
}
size: 1
I0813 04:24:27.098982 139812382340928 tensor_representation_util.py:347] Feature label_xf has a shape dim {
. Setting to DenseTensor.
}
size: 3
dim {
}
size: 224
dim {
}
size: 224
I0813 04:24:27.098610 139812382340928 tensor_representation_util.py:347] Feature image_xf has a shape dim {
I0813 04:24:26.905468 139812382340928 model.py:33] _________________________________________________________________
I0813 04:24:26.905363 139812382340928 model.py:33] Non-trainable params: 23,587,712
I0813 04:24:26.905245 139812382340928 model.py:33] Trainable params: 20,490
I0813 04:24:26.905140 139812382340928 model.py:33] Total params: 23,608,202
I0813 04:24:26.900732 139812382340928 model.py:33] =================================================================
I0813 04:24:26.900615 139812382340928 model.py:33]
I0813 04:24:26.900457 139812382340928 model.py:33]  dense (Dense)               (None, 10)                20490
I0813 04:24:26.900074 139812382340928 model.py:33]
I0813 04:24:26.899939 139812382340928 model.py:33]  dropout (Dropout)           (None, 2048)              0
I0813 04:24:26.899660 139812382340928 model.py:33]
I0813 04:24:26.899552 139812382340928 model.py:33]  resnet50 (Functional)       (None, 2048)              23587712
I0813 04:24:26.895043 139812382340928 model.py:33] =================================================================
I0813 04:24:26.894924 139812382340928 model.py:33]  Layer (type)                Output Shape              Param #
I0813 04:24:26.894763 139812382340928 model.py:33] _________________________________________________________________
I0813 04:24:26.894547 139812382340928 model.py:33] Model: "sequential"
"8192/94765736 [..............................] - ETA: 0s
 5955584/94765736 [>.............................] - ETA: 0s
14000128/94765736 [===>..........................] - ETA: 0s
20971520/94765736 [=====>........................] - ETA: 0s
28442624/94765736 [========>.....................] - ETA: 0s
36356096/94765736 [==========>...................] - ETA: 0s
44326912/94765736 [=============>................] - ETA: 0s
52133888/94765736 [===============>..............] - ETA: 0s
60121088/94765736 [==================>...........] - ETA: 0s
67960832/94765736 [====================>.........] - ETA: 0s
75710464/94765736 [======================>.......] - ETA: 0s
83501056/94765736 [=========================>....] - ETA: 0s
91258880/94765736 [===========================>..] - ETA: 0s
94765736/94765736 [==============================] - 1s 0us/step"
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5

tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().
to opt-out, you may do so by running
please refer to https://policies.google.com/privacy. If you wish
Cloud Services in accordance with Google privacy policy, for more information
This application reports technical and operational details of your usage of

2022-08-13 04:24:23.295037: I tensorflow/core/common_runtime/gpu/[gpu_device.cc:1532](http://gpu_device.cc:1532/)] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10807 MB memory:  -> device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7
2022-08-13 04:24:23.294182: I tensorflow/stream_executor/cuda/[cuda_gpu_executor.cc:975](http://cuda_gpu_executor.cc:975/)] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-13 04:24:22.732728: I tensorflow/core/platform/[cpu_feature_guard.cc:193](http://cpu_feature_guard.cc:193/)] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
I0813 04:24:21.739562 139812382340928 google_api_client.py:132] Detected running in DL_CONTAINER environment.
Load existing study...
I0813 04:24:21.737804 139812382340928 tuner.py:197] Study already exists: projects/gcp-ml-172005/locations/us-central1/studies/CloudTuner_study_20220813_042421.
Load existing study...
INFO:tensorflow:Study already exists: projects/gcp-ml-172005/locations/us-central1/studies/CloudTuner_study_20220813_042421.
I0813 04:24:21.696875 139812382340928 tuner.py:197] {'name': 'projects/874401645461/locations/us-central1/studies/CloudTuner_study_20220813_042421', 'studyConfig': {'metrics': [{'goal': 'MAXIMIZE', 'metric': 'val_sparse_categorical_accuracy'}], 'parameters': [{'parameter': 'learning_rate', 'type': 'DISCRETE', 'discreteValueSpec': {'values': [0.001, 0.01]}}], 'automatedStoppingConfig': {'decayCurveStoppingConfig': {'useElapsedTime': True}}}, 'state': 'ACTIVE', 'createTime': '2022-08-13T04:24:21Z'}
INFO:tensorflow:{'name': 'projects/874401645461/locations/us-central1/studies/CloudTuner_study_20220813_042421', 'studyConfig': {'metrics': [{'goal': 'MAXIMIZE', 'metric': 'val_sparse_categorical_accuracy'}], 'parameters': [{'parameter': 'learning_rate', 'type': 'DISCRETE', 'discreteValueSpec': {'values': [0.001, 0.01]}}], 'automatedStoppingConfig': {'decayCurveStoppingConfig': {'useElapsedTime': True}}}, 'state': 'ACTIVE', 'createTime': '2022-08-13T04:24:21Z'}
I0813 04:24:21.171569 139812382340928 google_api_client.py:132] Detected running in DL_CONTAINER environment.
I0813 04:24:21.171575 139812382340928 google_api_client.py:132] Detected running in DL_CONTAINER environment.

tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().
to opt-out, you may do so by running
please refer to https://policies.google.com/privacy. If you wish
Cloud Services in accordance with Google privacy policy, for more information
This application reports technical and operational details of your usage of
I0813 04:24:21.170827 139812382340928 google_api_client.py:185]

tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().
to opt-out, you may do so by running
please refer to https://policies.google.com/privacy. If you wish
Cloud Services in accordance with Google privacy policy, for more information
This application reports technical and operational details of your usage of
I0813 04:24:21.170828 139812382340928 google_api_client.py:185]
W0813 04:24:21.157318 139812382340928 examples_utils.py:50] Examples artifact does not have payload_format custom property. Falling back to FORMAT_TF_EXAMPLE
I0813 04:24:21.156335 139812382340928 fn_args_utils.py:138] Evaluate on the 'eval' split when eval_args.splits is not set.
I0813 04:24:21.156317 139812382340928 fn_args_utils.py:138] Evaluate on the 'eval' split when eval_args.splits is not set.
I0813 04:24:21.156160 139812382340928 fn_args_utils.py:134] Train on the 'train' split when train_args.splits is not set.
I0813 04:24:21.156160 139812382340928 fn_args_utils.py:134] Train on the 'train' split when train_args.splits is not set.
I0813 04:24:20.723299 139812382340928 udf_utils.py:48] udf_utils.get_fn {'custom_config': '{"ai_platform_tuning_args": {"masterConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test)"}, "masterType": "n1-standard-4", "project": "gcp-ml-172005", "region": "us-central1", "scaleTier": "CUSTOM", "serviceAccount": "[vizier@gcp-ml-172005.iam.gserviceaccount.com](mailto:vizier@gcp-ml-172005.iam.gserviceaccount.com)", "workerConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test)"}, "workerCount": 3, "workerType": "n1-standard-4"}, "remote_trials_working_dir": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials"}', 'eval_args': '{\n  "num_steps": 4\n}', 'train_args': '{\n  "num_steps": 160\n}', 'tune_args': '{\n  "num_parallel_trials": 3\n}', 'tuner_fn': 'models.model.cloud_tuner_fn'} 'tuner_fn'
I0813 04:24:20.723111 139812382340928 executor.py:212] Binding chief oracle server at: 0.0.0.0:2222
I0813 04:24:20.722659 139812382340928 executor.py:200] chief_oracle() starting...
I0813 04:24:20.722256 139812382340928 udf_utils.py:48] udf_utils.get_fn {'custom_config': '{"ai_platform_tuning_args": {"masterConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test)"}, "masterType": "n1-standard-4", "project": "gcp-ml-172005", "region": "us-central1", "scaleTier": "CUSTOM", "serviceAccount": "[vizier@gcp-ml-172005.iam.gserviceaccount.com](mailto:vizier@gcp-ml-172005.iam.gserviceaccount.com)", "workerConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test)"}, "workerCount": 3, "workerType": "n1-standard-4"}, "remote_trials_working_dir": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials"}', 'eval_args': '{\n  "num_steps": 4\n}', 'train_args': '{\n  "num_steps": 160\n}', 'tune_args': '{\n  "num_parallel_trials": 3\n}', 'tuner_fn': 'models.model.cloud_tuner_fn'} 'tuner_fn'
I0813 04:24:20.722024 139812382340928 executor.py:275] Setting KERASTUNER_TUNER_ID with tfx-tuner-master-0
I0813 04:24:20.721865 139812382340928 executor.py:267] Oracle chief is known to be at: cmle-training-master-afa651e2fc-0:2222
I0813 04:24:20.720906 139812382340928 executor.py:233] Chief oracle started at PID: 16
I0813 04:24:20.710414 139812382340928 run_executor.py:155] Starting executor
I0813 04:24:20.709932 139812382340928 executor.py:332] Tuner ID is: tfx-tuner-master-0
I0813 04:24:20.709692 139812382340928 executor.py:300] Cluster spec initalized with: {'cluster': {'master': ['cmle-training-master-afa651e2fc-0:2222'], 'worker': ['cmle-training-worker-afa651e2fc-0:2222', 'cmle-training-worker-afa651e2fc-1:2222']}, 'environment': 'cloud', 'task': {'type': 'master', 'index': 0}, 'job': '{\n  "scale_tier": "CUSTOM",\n  "master_type": "n1-standard-4",\n  "worker_type": "n1-standard-4",\n  "worker_count": "2",\n  "region": "us-central1",\n  "master_config": {\n    "accelerator_config": {\n      "count": "1",\n      "type": "NVIDIA_TESLA_K80"\n    },\n    "image_uri": "[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test)",\n    "container_command": ["python", "-m", "tfx.scripts.run_executor", "--executor_class_path", "tfx.extensions.google_cloud_ai_platform.tuner.executor._WorkerExecutor", "--inputs", "{\\"transform_graph\\": [{\\"artifact\\": {\\"id\\": \\"3040439057790690801\\", \\"uri\\": \\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Transform_2187476734192910336/transform_graph\\", \\"custom_properties\\": {\\"tfx_version\\": {\\"struct_value\\": {\\"__value__\\": \\"1.9.1\\"}}}}, \\"artifact_type\\": {\\"name\\": \\"TransformGraph\\"}, \\"__artifact_class_module__\\": \\"tfx.types.standard_artifacts\\", \\"__artifact_class_name__\\": \\"TransformGraph\\"}], \\"examples\\": [{\\"artifact\\": {\\"id\\": \\"6958007971664455536\\", \\"uri\\": \\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Transform_2187476734192910336/transformed_examples\\", \\"properties\\": {\\"split_names\\": {\\"string_value\\": \\"[\\\\\\"eval\\\\\\", \\\\\\"train\\\\\\"]\\"}}, \\"custom_properties\\": {\\"tfx_version\\": {\\"struct_value\\": {\\"__value__\\": \\"1.9.1\\"}}}}, \\"artifact_type\\": {\\"name\\": \\"Examples\\", \\"properties\\": {\\"span\\": \\"INT\\", \\"split_names\\": \\"STRING\\", \\"version\\": \\"INT\\"}, \\"base_type\\": \\"DATASET\\"}, \\"__artifact_class_module__\\": \\"tfx.types.standard_artifacts\\", \\"__artifact_class_name__\\": \\"Examples\\"}]}", "--outputs", "{\\"best_hyperparameters\\": [{\\"artifact\\": {\\"id\\": \\"3312416091851715625\\", \\"uri\\": \\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Tuner_-7035895302661865472/best_hyperparameters\\"}, \\"artifact_type\\": {\\"name\\": \\"HyperParameters\\"}, \\"__artifact_class_module__\\": \\"tfx.types.standard_artifacts\\", \\"__artifact_class_name__\\": \\"HyperParameters\\"}]}", "--exec-properties", "{\\"custom_config\\": \\"{\\\\\\"ai_platform_tuning_args\\\\\\": {\\\\\\"masterConfig\\\\\\": {\\\\\\"acceleratorConfig\\\\\\": {\\\\\\"count\\\\\\": 1, \\\\\\"type\\\\\\": \\\\\\"NVIDIA_TESLA_K80\\\\\\"}, \\\\\\"imageUri\\\\\\": \\\\\\"[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test\\\\\\](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test%5C%5C%5C%5C%5C%5C)"}, \\\\\\"masterType\\\\\\": \\\\\\"n1-standard-4\\\\\\", \\\\\\"project\\\\\\": \\\\\\"gcp-ml-172005\\\\\\", \\\\\\"region\\\\\\": \\\\\\"us-central1\\\\\\", \\\\\\"scaleTier\\\\\\": \\\\\\"CUSTOM\\\\\\", \\\\\\"serviceAccount\\\\\\": \\\\\\"[vizier@gcp-ml-172005.iam.gserviceaccount.com](mailto:vizier@gcp-ml-172005.iam.gserviceaccount.com)\\\\\\", \\\\\\"workerConfig\\\\\\": {\\\\\\"acceleratorConfig\\\\\\": {\\\\\\"count\\\\\\": 1, \\\\\\"type\\\\\\": \\\\\\"NVIDIA_TESLA_K80\\\\\\"}, \\\\\\"imageUri\\\\\\": \\\\\\"[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test\\\\\\](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test%5C%5C%5C%5C%5C%5C)"}, \\\\\\"workerCount\\\\\\": 3, \\\\\\"workerType\\\\\\": \\\\\\"n1-standard-4\\\\\\"}, \\\\\\"remote_trials_working_dir\\\\\\": \\\\\\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials\\\\\\"}\\", \\"eval_args\\": \\"{\\\\n  \\\\\\"num_steps\\\\\\": 4\\\\n}\\", \\"train_args\\": \\"{\\\\n  \\\\\\"num_steps\\\\\\": 160\\\\n}\\", \\"tune_args\\": \\"{\\\\n  \\\\\\"num_parallel_trials\\\\\\": 3\\\\n}\\", \\"tuner_fn\\": \\"models.model.cloud_tuner_fn\\"}"]\n  },\n  "worker_config": {\n    "accelerator_config": {\n      "count": "1",\n      "type": "NVIDIA_TESLA_K80"\n    },\n    "image_uri": "[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test)"\n  },\n  "service_account": "[vizier@gcp-ml-172005.iam.gserviceaccount.com](mailto:vizier@gcp-ml-172005.iam.gserviceaccount.com)"\n}'}
I0813 04:24:20.709398 139812382340928 executor.py:292] Initializing cluster spec...

I0813 04:24:16.823370 139812382340928 executor.py:43] tensorflow_text is not available: No module named 'tensorflow_text'
I0813 04:24:16.796857 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:16.211828 139812382340928 model_util.py:68] struct2tensor is not available: No module named 'struct2tensor'
I0813 04:24:16.211477 139812382340928 model_util.py:63] tensorflow_decision_forests is not available: No module named 'tensorflow_decision_forests'
I0813 04:24:16.211113 139812382340928 model_util.py:58] tensorflow_text is not available: No module named 'tensorflow_text'
I0813 04:24:16.210595 139812382340928 model_util.py:53] tensorflow_ranking is not available: No module named 'tensorflow_ranking'
I0813 04:24:16.210203 139812382340928 model_util.py:44] imported tensorflow_io
I0813 04:24:15.824912 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
I0813 04:24:15.824463 139812382340928 native_type_compatibility.py:250] Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
)]}, exec_properties: {'custom_config': '{"ai_platform_tuning_args": {"masterConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test)"}, "masterType": "n1-standard-4", "project": "gcp-ml-172005", "region": "us-central1", "scaleTier": "CUSTOM", "serviceAccount": "[vizier@gcp-ml-172005.iam.gserviceaccount.com](mailto:vizier@gcp-ml-172005.iam.gserviceaccount.com)", "workerConfig": {"acceleratorConfig": {"count": 1, "type": "NVIDIA_TESLA_K80"}, "imageUri": "[gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test](http://gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test)"}, "workerCount": 3, "workerType": "n1-standard-4"}, "remote_trials_working_dir": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials"}', 'eval_args': '{\n  "num_steps": 4\n}', 'train_args': '{\n  "num_steps": 160\n}', 'tune_args': '{\n  "num_parallel_trials": 3\n}', 'tuner_fn': 'models.model.cloud_tuner_fn'}
, artifact_type: name: "HyperParameters"
uri: "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Tuner_-7035895302661865472/best_hyperparameters"
)]}, outputs: {'best_hyperparameters': [Artifact(artifact: id: 3312416091851715625
base_type: DATASET
}
  value: INT
  key: "version"
properties {
}
  value: STRING
  key: "split_names"
properties {
}
  value: INT
  key: "span"
properties {
, artifact_type: name: "Examples"
}
  }
    }
      }
        }
string_value: "1.9.1"
        value {
        key: "__value__"
      fields {
    struct_value {
  value {
  key: "tfx_version"
custom_properties {
}
  }
    string_value: "[\"eval\", \"train\"]"
  value {
  key: "split_names"
properties {
uri: "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Transform_2187476734192910336/transformed_examples"
)], 'examples': [Artifact(artifact: id: 6958007971664455536
, artifact_type: name: "TransformGraph"
}
  }
    }
      }
        }
string_value: "1.9.1"
        value {
        key: "__value__"
      fields {
    struct_value {
  value {
  key: "tfx_version"
custom_properties {
uri: "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220813040932/Transform_2187476734192910336/transform_graph"
I0813 04:24:15.720419 139812382340928 run_executor.py:141] Executor tfx.extensions.google_cloud_ai_platform.tuner.executor._WorkerExecutor do: inputs: {'transform_graph': [Artifact(artifact: id: 3040439057790690801
2022-08-13 04:24:15.695750: I tensorflow/stream_executor/cuda/[cuda_gpu_executor.cc:975](http://cuda_gpu_executor.cc:975/)] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:15.694653: I tensorflow/stream_executor/cuda/[cuda_gpu_executor.cc:975](http://cuda_gpu_executor.cc:975/)] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-08-13 04:24:15.510332: I tensorflow/stream_executor/cuda/[cuda_gpu_executor.cc:975](http://cuda_gpu_executor.cc:975/)] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022/08/13 04:24:10 No id provided.
"File system has been successfully mounted.
"
"Mounting file system "gcsfuse"...
"
"Opening GCS connection...
"
"File system has been successfully mounted.
"
"Mounting file system "gcsfuse"...
"
"Opening GCS connection...
"
"File system has been successfully mounted.
"
"Mounting file system "gcsfuse"...
"
"Opening GCS connection...
"

Job tfx_tuner_20220813041519 is queued.

Job creation request has been successfully validated.
1025KB commented 2 years ago

Hi, Are you using CAIP or Vertex? This CloudTuner currently only support CAIP.

And how many parallel tuning do you have?

deep-diver commented 2 years ago

@1025KB

I am using CAIP not Vertex. I mean I call CloudTuner from Vertex Pipeline. And max_trials and workerCount are set to 6 and 3 respectively.

Are you saying that I can't integrate CloudTuner within Vertex Pipeline?

1025KB commented 2 years ago

If you run a standalone CloudTuner, does it work? Or run a Cloud tuning component with RandomSearch tuner, does it work?

deep-diver commented 2 years ago

Thank you!

I have not tried either of standalone and RandomSeaech version. Could you please drop links that I can take a look?

Here is the repo that I am working on by the way: https://github.com/deep-diver/complete-mlops-system-workflow/tree/fix/cloud-tuner/training_pipeline/pipeline

deep-diver commented 2 years ago

@1025KB

Oh you mean using KerasTuner instead CloudTuner? If so, yes it worked fine with tfx run create --engine=local.

1025KB commented 2 years ago

On Cloud (KubeflowDagRunner + extension.Tuner) you can also just use KerasTuner, e.g., RandomSearch in your tuner_fn. I want to know if your workflow had issue on CloudTuner or other part of the the workflow

deep-diver commented 2 years ago

@1025KB

OK, I just tried out KerasTuner with extension.Tuner component, and it didn't work out.

It seems like the circumstance is the same. I see below from AI Platform Job logs dashboard.

Best val_sparse_categorical_accuracy So Far: 0.140625
Total elapsed time: 00h 00m 22s
Results summary
Results in /tmp/img_classification_tuning

and I get the following message from the pod in Vertex Pipeline.

Error File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/__init__.py", line 44, in autodetect
Error from . import file_cache
Error File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
Error "file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth"
Error ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Error I0815 04:39:08.493447 140048071423808 training_clients.py:262] TrainingJob={'job_id': 'tfx_tuner_20220815043908', 'training_input': {'masterConfig': {'acceleratorConfig': {'count': 1, 'type': 'NVIDIA_TESLA_K80'}, 'imageUri': 'gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test', 'containerCommand': ['python', '-m', 'tfx.scripts.run_executor', '--executor_class_path', 'tfx.extensions.google_cloud_ai_platform.tuner.executor._WorkerExecutor', '--inputs', '{"examples": [{"artifact": {"id": "1729733612679284927", "uri": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220815041740/Transform_3489170958130872320/transformed_examples", "properties": {"split_names": {"string_value": "[\\"eval\\", \\"train\\"]"}}, "custom_properties": {"tfx_version": {"struct_value": {"__value__": "1.9.1"}}}}, "artifact_type": {"name": "Examples", "properties": {"span": "INT", "version": "INT", "split_names": "STRING"}, "base_type": "DATASET"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "Examples"}], "transform_graph": [{"artifact": {"id": "1437506892152175304", "uri": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220815041740/Transform_3489170958130872320/transform_graph", "custom_properties": {"tfx_version": {"struct_value": {"__value__": "1.9.1"}}}}, "artifact_type": {"name": "TransformGraph"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "TransformGraph"}]}', '--outputs', '{"best_hyperparameters": [{"artifact": {"id": "8185408808084326621", "uri": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220815041740/Tuner_-5734201078723903488/best_hyperparameters"}, "artifact_type": {"name": "HyperParameters"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "HyperParameters"}]}', '--exec-properties', '{"custom_config": "{\\"ai_platform_tuning_args\\": {\\"masterConfig\\": {\\"acceleratorConfig\\": {\\"count\\": 1, \\"type\\": \\"NVIDIA_TESLA_K80\\"}, \\"imageUri\\": \\"gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test\\"}, \\"masterType\\": \\"n1-standard-4\\", \\"project\\": \\"gcp-ml-172005\\", \\"region\\": \\"us-central1\\", \\"scaleTier\\": \\"CUSTOM\\", \\"serviceAccount\\": \\"vizier@gcp-ml-172005.iam.gserviceaccount.com\\", \\"workerConfig\\": {\\"acceleratorConfig\\": {\\"count\\": 1, \\"type\\": \\"NVIDIA_TESLA_K80\\"}, \\"imageUri\\": \\"gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test\\"}, \\"workerCount\\": 3, \\"workerType\\": \\"n1-standard-4\\"}, \\"remote_trials_working_dir\\": \\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials\\"}", "eval_args": "{\\n \\"num_steps\\": 4\\n}", "train_args": "{\\n \\"num_steps\\": 160\\n}", "tune_args": "{\\n \\"num_parallel_trials\\": 3\\n}", "tuner_fn": "models.model.tuner_fn"}']}, 'masterType': 'n1-standard-4', 'region': 'us-central1', 'scaleTier': 'CUSTOM', 'serviceAccount': 'vizier@gcp-ml-172005.iam.gserviceaccount.com', 'workerConfig': {'acceleratorConfig': {'count': 1, 'type': 'NVIDIA_TESLA_K80'}, 'imageUri': 'gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test'}, 'workerCount': 2, 'workerType': 'n1-standard-4'}, 'labels': {'tfx_version': '1-9-1', 'tfx_py_version': '3-7', 'tfx_executor': 'tfx-extensions-google_cloud_ai_platform-tuner-executor-_workere'}}
Error I0815 04:39:08.493797 140048071423808 training_clients.py:264] Submitting job='tfx_tuner_20220815043908', project='projects/gcp-ml-172005' to AI Platform.
Info Finished tearing down training program.
Info Job failed.

I see an ImportError on oauth2client. I am using tfx 1.9.1 Docker image as a base one. If this is the root cause, I assume it should fail extensions.Trainer component too then, but it didn't.

1025KB commented 2 years ago

are you able to run the trainer in CAIP? here is the tutorial

for vertex, this is the tutorial

deep-diver commented 2 years ago

@1025KB

I see. My bad on describing the problem wrong.

I used Trainer component for Vertex AI, and I used Tuner component for CAIP. The Trainer component works well without any failure, but the Tuner component fails.

Is this because I tried to hook up between VertexAI and CAIP? If so, TFX doesn't support Tuner component within Vertex Pipeline?

1025KB commented 2 years ago

Are you able to run pipeline & Trainer component with CAIP? I wondering if it's your CAIP setup.

deep-diver commented 2 years ago

Not sure I am not using CAIP, but included CloudTuner component in Vertex which uses CAIP

1025KB commented 2 years ago

you can run extension.Tuner with KerasTuner in tuner_fn on Vertex

only CloudTuner in tuner_fn requires CAIP

deep-diver commented 2 years ago

Yeah

But KerasTuner failed with the logs that I shared here https://github.com/tensorflow/tfx/issues/5141#issuecomment-1214618023

But it worked successfully with local engine.

1025KB commented 2 years ago

are you using CAIP or Vertex, if it's KerasTuners, you can use Vertex (similar custom_config as trainer)

deep-diver commented 2 years ago

Let me clarify,

I am currently using Vertex Pipeline.

My initial attempt was : include CloudTuner in the Vertex Pipeline, and it turned out as failure.

My second attempt was : include KerasTuner in the Vertex Pipeline, and it also turned out as failure (logs). : here is the source code for using KerasTuner, and the full source code is here

### Tuner Args
GCP_AI_PLATFORM_TUNER_ARGS = {
    vertex_tuner_const.TUNING_ARGS_KEY: {
        "project": GOOGLE_CLOUD_PROJECT,
        "region": "us-central1",
        "scaleTier": "CUSTOM",
        "masterType": "n1-standard-4",
        "masterConfig": {
            "imageUri": PIPELINE_IMAGE,
            "acceleratorConfig": {
                "count": 1,
                "type": "NVIDIA_TESLA_K80",
            },
        },
        "workerType": "n1-standard-4",
        "workerCount": 3,
        "workerConfig": {
            "imageUri": PIPELINE_IMAGE,
            "acceleratorConfig": {
                "count": 1,
                "type": "NVIDIA_TESLA_K80",
            },
        },
        "serviceAccount": "vizier@gcp-ml-172005.iam.gserviceaccount.com",
    },
    vertex_tuner_const.REMOTE_TRIALS_WORKING_DIR_KEY: os.path.join(
        PIPELINE_ROOT, "trials"
    ),
}

### Pipeline
from tfx.extensions.google_cloud_ai_platform.tuner.component import Tuner 
    tuner = Tuner(
        tuner_fn=modules["cloud_tuner_fn"],
        examples=transform.outputs["transformed_examples"],
        transform_graph=transform.outputs["transform_graph"],
        train_args=train_args,
        eval_args=eval_args,
        tune_args=tuner_args,
        custom_config=ai_platform_tuner_args,
    )

### modules["cloud_tuner_fn"] which is "tuner_fn"

def tuner_fn(fn_args: FnArgs) -> TunerFnResult:
    steps_per_epoch = int(_TRAIN_DATA_SIZE / _TRAIN_BATCH_SIZE)

    tuner = keras_tuner.RandomSearch(
        _build_keras_model,
        max_trials=6,
        hyperparameters=_get_hyperparameters(),
        allow_new_entries=False,
        objective=keras_tuner.Objective("val_sparse_categorical_accuracy", "max"),
        directory=fn_args.working_dir,
        project_name="img_classification_tuning",
    )

    tf_transform_output = tft.TFTransformOutput(fn_args.transform_graph_path)

    train_dataset = _input_fn(
        fn_args.train_files,
        fn_args.data_accessor,
        tf_transform_output,
        is_train=True,
        batch_size=_TRAIN_BATCH_SIZE,
    )

    eval_dataset = _input_fn(
        fn_args.eval_files,
        fn_args.data_accessor,
        tf_transform_output,
        is_train=False,
        batch_size=_EVAL_BATCH_SIZE,
    )

    return TunerFnResult(
        tuner=tuner,
        fit_kwargs={
            "x": train_dataset,
            "validation_data": eval_dataset,
            "steps_per_epoch": steps_per_epoch,
            "validation_steps": fn_args.eval_steps,
        },
    )
deep-diver commented 2 years ago

Maybe I should modify the GCP_AI_PLATFORM_TUNER_ARGS differently for not using CAIP. When using KerasTuner, where does it perform the job? CAIP? Vertex?

1025KB commented 2 years ago

If you set up the custom_config to use vertex, it will run KerasTuner on Vertex

deep-diver commented 2 years ago

so, the custom_config should be set similar to what applied to Vertex Training?

1025KB commented 2 years ago

yep, need to add ENABLE_VERTEX_KEY & VERTEX_REGION_KEY in additional to TUNING_ARGS_KEY and REMOTE_TRIALS_WORKING_DIR_KEY

deep-diver commented 2 years ago

Great thanks!

I will let you know how it goes

deep-diver commented 2 years ago

@1025KB

This configs works for Vertex Training (if replacing TUNING_ARGS_KEY to TRAINING_ARGS_KEY), but it failed for KerasTuner on Vertex. Could you please take a look?

Particularly, it complains about KeyError: 'job_spec'

Error
File "/opt/conda/lib/python3.7/site-packages/tfx/extensions/google_cloud_ai_platform/tuner/executor.py", line 121, in Do
Error
worker_pool_specs = training_inputs['job_spec'].get('worker_pool_specs')
KeyError: 'job_spec'
import tfx.extensions.google_cloud_ai_platform.constants as vertex_const
import tfx.extensions.google_cloud_ai_platform.tuner.executor as vertex_tuner_const

GCP_AI_PLATFORM_TUNER_ARGS = {
    vertex_const.ENABLE_VERTEX_KEY: True,
    vertex_const.VERTEX_REGION_KEY: GOOGLE_CLOUD_REGION,
    vertex_tuner_const.TUNING_ARGS_KEY: {
        "project": GOOGLE_CLOUD_PROJECT,
        "worker_pool_specs": [
            {
                "machine_spec": {
                    "machine_type": "n1-standard-4",
                    "accelerator_type": "NVIDIA_TESLA_K80",
                    "accelerator_count": 1,
                },
                "replica_count": 1,
                "container_spec": {
                    "image_uri": PIPELINE_IMAGE,
                },
            }
        ],
    },
    vertex_tuner_const.REMOTE_TRIALS_WORKING_DIR_KEY: os.path.join(
        PIPELINE_ROOT, "trials"
    ),
    "use_gpu": True,
}
deep-diver commented 2 years ago

Never mind, I figured out! :)

GCP_AI_PLATFORM_TUNER_ARGS = {
    vertex_const.ENABLE_VERTEX_KEY: True,
    vertex_const.VERTEX_REGION_KEY: GOOGLE_CLOUD_REGION,
    vertex_tuner_const.TUNING_ARGS_KEY: {
        "project": GOOGLE_CLOUD_PROJECT,
        # "serviceAccount": "vizier@gcp-ml-172005.iam.gserviceaccount.com",
        "job_spec": {
            "worker_pool_specs": [
                {
                    "machine_spec": {
                        "machine_type": "n1-standard-4",
                        "accelerator_type": "NVIDIA_TESLA_K80",
                        "accelerator_count": 1,
                    },
                    "replica_count": 1,
                    "container_spec": {
                        "image_uri": PIPELINE_IMAGE,
                    },
                }
            ],
        },
    },
    vertex_tuner_const.REMOTE_TRIALS_WORKING_DIR_KEY: os.path.join(
        PIPELINE_ROOT, "trials"
    ),
    "use_gpu": True,
}

Thanks for the great support @1025KB

google-ml-butler[bot] commented 2 years ago

Are you satisfied with the resolution of your issue? Yes No

1025KB commented 2 years ago

Cool!