Closed deep-diver closed 2 years ago
Hi, Are you using CAIP or Vertex? This CloudTuner currently only support CAIP.
And how many parallel tuning do you have?
@1025KB
I am using CAIP not Vertex. I mean I call CloudTuner from Vertex Pipeline. And max_trials and workerCount are set to 6 and 3 respectively.
Are you saying that I can't integrate CloudTuner within Vertex Pipeline?
If you run a standalone CloudTuner, does it work? Or run a Cloud tuning component with RandomSearch tuner, does it work?
Thank you!
I have not tried either of standalone and RandomSeaech version. Could you please drop links that I can take a look?
Here is the repo that I am working on by the way: https://github.com/deep-diver/complete-mlops-system-workflow/tree/fix/cloud-tuner/training_pipeline/pipeline
@1025KB
Oh you mean using KerasTuner instead CloudTuner? If so, yes it worked fine with tfx run create --engine=local
.
On Cloud (KubeflowDagRunner + extension.Tuner) you can also just use KerasTuner, e.g., RandomSearch in your tuner_fn. I want to know if your workflow had issue on CloudTuner or other part of the the workflow
@1025KB
OK, I just tried out KerasTuner
with extension.Tuner
component, and it didn't work out.
It seems like the circumstance is the same. I see below from AI Platform Job logs dashboard.
Best val_sparse_categorical_accuracy So Far: 0.140625
Total elapsed time: 00h 00m 22s
Results summary
Results in /tmp/img_classification_tuning
and I get the following message from the pod in Vertex Pipeline.
Error File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/__init__.py", line 44, in autodetect
Error from . import file_cache
Error File "/opt/conda/lib/python3.7/site-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
Error "file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth"
Error ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Error I0815 04:39:08.493447 140048071423808 training_clients.py:262] TrainingJob={'job_id': 'tfx_tuner_20220815043908', 'training_input': {'masterConfig': {'acceleratorConfig': {'count': 1, 'type': 'NVIDIA_TESLA_K80'}, 'imageUri': 'gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test', 'containerCommand': ['python', '-m', 'tfx.scripts.run_executor', '--executor_class_path', 'tfx.extensions.google_cloud_ai_platform.tuner.executor._WorkerExecutor', '--inputs', '{"examples": [{"artifact": {"id": "1729733612679284927", "uri": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220815041740/Transform_3489170958130872320/transformed_examples", "properties": {"split_names": {"string_value": "[\\"eval\\", \\"train\\"]"}}, "custom_properties": {"tfx_version": {"struct_value": {"__value__": "1.9.1"}}}}, "artifact_type": {"name": "Examples", "properties": {"span": "INT", "version": "INT", "split_names": "STRING"}, "base_type": "DATASET"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "Examples"}], "transform_graph": [{"artifact": {"id": "1437506892152175304", "uri": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220815041740/Transform_3489170958130872320/transform_graph", "custom_properties": {"tfx_version": {"struct_value": {"__value__": "1.9.1"}}}}, "artifact_type": {"name": "TransformGraph"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "TransformGraph"}]}', '--outputs', '{"best_hyperparameters": [{"artifact": {"id": "8185408808084326621", "uri": "gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/874401645461/resnet50-tfx-pipeline-tuner-test-20220815041740/Tuner_-5734201078723903488/best_hyperparameters"}, "artifact_type": {"name": "HyperParameters"}, "__artifact_class_module__": "tfx.types.standard_artifacts", "__artifact_class_name__": "HyperParameters"}]}', '--exec-properties', '{"custom_config": "{\\"ai_platform_tuning_args\\": {\\"masterConfig\\": {\\"acceleratorConfig\\": {\\"count\\": 1, \\"type\\": \\"NVIDIA_TESLA_K80\\"}, \\"imageUri\\": \\"gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test\\"}, \\"masterType\\": \\"n1-standard-4\\", \\"project\\": \\"gcp-ml-172005\\", \\"region\\": \\"us-central1\\", \\"scaleTier\\": \\"CUSTOM\\", \\"serviceAccount\\": \\"vizier@gcp-ml-172005.iam.gserviceaccount.com\\", \\"workerConfig\\": {\\"acceleratorConfig\\": {\\"count\\": 1, \\"type\\": \\"NVIDIA_TESLA_K80\\"}, \\"imageUri\\": \\"gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test\\"}, \\"workerCount\\": 3, \\"workerType\\": \\"n1-standard-4\\"}, \\"remote_trials_working_dir\\": \\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/resnet50-tfx-pipeline-tuner-test/trials\\"}", "eval_args": "{\\n \\"num_steps\\": 4\\n}", "train_args": "{\\n \\"num_steps\\": 160\\n}", "tune_args": "{\\n \\"num_parallel_trials\\": 3\\n}", "tuner_fn": "models.model.tuner_fn"}']}, 'masterType': 'n1-standard-4', 'region': 'us-central1', 'scaleTier': 'CUSTOM', 'serviceAccount': 'vizier@gcp-ml-172005.iam.gserviceaccount.com', 'workerConfig': {'acceleratorConfig': {'count': 1, 'type': 'NVIDIA_TESLA_K80'}, 'imageUri': 'gcr.io/gcp-ml-172005/resnet50-tfx-pipeline-tuner-test'}, 'workerCount': 2, 'workerType': 'n1-standard-4'}, 'labels': {'tfx_version': '1-9-1', 'tfx_py_version': '3-7', 'tfx_executor': 'tfx-extensions-google_cloud_ai_platform-tuner-executor-_workere'}}
Error I0815 04:39:08.493797 140048071423808 training_clients.py:264] Submitting job='tfx_tuner_20220815043908', project='projects/gcp-ml-172005' to AI Platform.
Info Finished tearing down training program.
Info Job failed.
I see an ImportError on oauth2client
. I am using tfx 1.9.1
Docker image as a base one. If this is the root cause, I assume it should fail extensions.Trainer component too then, but it didn't.
@1025KB
I see. My bad on describing the problem wrong.
I used Trainer component for Vertex AI, and I used Tuner component for CAIP. The Trainer component works well without any failure, but the Tuner component fails.
Is this because I tried to hook up between VertexAI and CAIP? If so, TFX doesn't support Tuner component within Vertex Pipeline?
Are you able to run pipeline & Trainer component with CAIP? I wondering if it's your CAIP setup.
Not sure I am not using CAIP, but included CloudTuner component in Vertex which uses CAIP
you can run extension.Tuner with KerasTuner in tuner_fn on Vertex
only CloudTuner in tuner_fn requires CAIP
Yeah
But KerasTuner failed with the logs that I shared here https://github.com/tensorflow/tfx/issues/5141#issuecomment-1214618023
But it worked successfully with local engine.
are you using CAIP or Vertex, if it's KerasTuners, you can use Vertex (similar custom_config as trainer)
Let me clarify,
I am currently using Vertex Pipeline.
My initial attempt was : include CloudTuner in the Vertex Pipeline, and it turned out as failure.
My second attempt was : include KerasTuner in the Vertex Pipeline, and it also turned out as failure (logs). : here is the source code for using KerasTuner, and the full source code is here
### Tuner Args
GCP_AI_PLATFORM_TUNER_ARGS = {
vertex_tuner_const.TUNING_ARGS_KEY: {
"project": GOOGLE_CLOUD_PROJECT,
"region": "us-central1",
"scaleTier": "CUSTOM",
"masterType": "n1-standard-4",
"masterConfig": {
"imageUri": PIPELINE_IMAGE,
"acceleratorConfig": {
"count": 1,
"type": "NVIDIA_TESLA_K80",
},
},
"workerType": "n1-standard-4",
"workerCount": 3,
"workerConfig": {
"imageUri": PIPELINE_IMAGE,
"acceleratorConfig": {
"count": 1,
"type": "NVIDIA_TESLA_K80",
},
},
"serviceAccount": "vizier@gcp-ml-172005.iam.gserviceaccount.com",
},
vertex_tuner_const.REMOTE_TRIALS_WORKING_DIR_KEY: os.path.join(
PIPELINE_ROOT, "trials"
),
}
### Pipeline
from tfx.extensions.google_cloud_ai_platform.tuner.component import Tuner
tuner = Tuner(
tuner_fn=modules["cloud_tuner_fn"],
examples=transform.outputs["transformed_examples"],
transform_graph=transform.outputs["transform_graph"],
train_args=train_args,
eval_args=eval_args,
tune_args=tuner_args,
custom_config=ai_platform_tuner_args,
)
### modules["cloud_tuner_fn"] which is "tuner_fn"
def tuner_fn(fn_args: FnArgs) -> TunerFnResult:
steps_per_epoch = int(_TRAIN_DATA_SIZE / _TRAIN_BATCH_SIZE)
tuner = keras_tuner.RandomSearch(
_build_keras_model,
max_trials=6,
hyperparameters=_get_hyperparameters(),
allow_new_entries=False,
objective=keras_tuner.Objective("val_sparse_categorical_accuracy", "max"),
directory=fn_args.working_dir,
project_name="img_classification_tuning",
)
tf_transform_output = tft.TFTransformOutput(fn_args.transform_graph_path)
train_dataset = _input_fn(
fn_args.train_files,
fn_args.data_accessor,
tf_transform_output,
is_train=True,
batch_size=_TRAIN_BATCH_SIZE,
)
eval_dataset = _input_fn(
fn_args.eval_files,
fn_args.data_accessor,
tf_transform_output,
is_train=False,
batch_size=_EVAL_BATCH_SIZE,
)
return TunerFnResult(
tuner=tuner,
fit_kwargs={
"x": train_dataset,
"validation_data": eval_dataset,
"steps_per_epoch": steps_per_epoch,
"validation_steps": fn_args.eval_steps,
},
)
Maybe I should modify the GCP_AI_PLATFORM_TUNER_ARGS
differently for not using CAIP. When using KerasTuner, where does it perform the job? CAIP? Vertex?
If you set up the custom_config to use vertex, it will run KerasTuner on Vertex
so, the custom_config
should be set similar to what applied to Vertex Training?
yep, need to add ENABLE_VERTEX_KEY & VERTEX_REGION_KEY in additional to TUNING_ARGS_KEY and REMOTE_TRIALS_WORKING_DIR_KEY
Great thanks!
I will let you know how it goes
@1025KB
This configs works for Vertex Training (if replacing TUNING_ARGS_KEY
to TRAINING_ARGS_KEY
), but it failed for KerasTuner on Vertex. Could you please take a look?
Particularly, it complains about KeyError: 'job_spec'
Error
File "/opt/conda/lib/python3.7/site-packages/tfx/extensions/google_cloud_ai_platform/tuner/executor.py", line 121, in Do
Error
worker_pool_specs = training_inputs['job_spec'].get('worker_pool_specs')
KeyError: 'job_spec'
import tfx.extensions.google_cloud_ai_platform.constants as vertex_const
import tfx.extensions.google_cloud_ai_platform.tuner.executor as vertex_tuner_const
GCP_AI_PLATFORM_TUNER_ARGS = {
vertex_const.ENABLE_VERTEX_KEY: True,
vertex_const.VERTEX_REGION_KEY: GOOGLE_CLOUD_REGION,
vertex_tuner_const.TUNING_ARGS_KEY: {
"project": GOOGLE_CLOUD_PROJECT,
"worker_pool_specs": [
{
"machine_spec": {
"machine_type": "n1-standard-4",
"accelerator_type": "NVIDIA_TESLA_K80",
"accelerator_count": 1,
},
"replica_count": 1,
"container_spec": {
"image_uri": PIPELINE_IMAGE,
},
}
],
},
vertex_tuner_const.REMOTE_TRIALS_WORKING_DIR_KEY: os.path.join(
PIPELINE_ROOT, "trials"
),
"use_gpu": True,
}
Never mind, I figured out! :)
GCP_AI_PLATFORM_TUNER_ARGS = {
vertex_const.ENABLE_VERTEX_KEY: True,
vertex_const.VERTEX_REGION_KEY: GOOGLE_CLOUD_REGION,
vertex_tuner_const.TUNING_ARGS_KEY: {
"project": GOOGLE_CLOUD_PROJECT,
# "serviceAccount": "vizier@gcp-ml-172005.iam.gserviceaccount.com",
"job_spec": {
"worker_pool_specs": [
{
"machine_spec": {
"machine_type": "n1-standard-4",
"accelerator_type": "NVIDIA_TESLA_K80",
"accelerator_count": 1,
},
"replica_count": 1,
"container_spec": {
"image_uri": PIPELINE_IMAGE,
},
}
],
},
},
vertex_tuner_const.REMOTE_TRIALS_WORKING_DIR_KEY: os.path.join(
PIPELINE_ROOT, "trials"
),
"use_gpu": True,
}
Thanks for the great support @1025KB
Cool!
I have successfully initiated Vizier Job via CloudTuner, but it failed.
I have looked into the logs, but there was no errors occurred, and the training was successfully done. Could you take a look what happened? The logs should be read from bottom to top.