tensorflow / cloud

The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging and training your Keras and TensorFlow code in a local environment to distributed training in the cloud.
https://github.com/tensorflow/cloud
Apache License 2.0
372 stars 85 forks source link

Running into "Internal error occurred for the current attempt" problem #387

Open deep-diver opened 2 years ago

deep-diver commented 2 years ago

I am using CloudTuner for TFX project, but I keep getting Internal error occurred for the current attempt error, and it doesn't show me what is the actual problem under the hood.

Below is the JSON passed to the CloudTuner, and this is my repository.

The imageUri, I passed the TFX docker image.

{
  "scaleTier": "CUSTOM",
  "masterType": "standard",
  "workerType": "standard",
  "workerCount": "2",
  "region": "us-central1",
  "masterConfig": {
    "imageUri": "gcr.io/gcp-ml-172005/img-classification",
    "containerCommand": [
      "python",
      "-m",
      "tfx.scripts.run_executor",
      "--executor_class_path",
      "tfx.extensions.google_cloud_ai_platform.tuner.executor._WorkerExecutor",
      "--inputs",
      "{\"examples\": [{\"artifact\": {\"id\": \"302652664909979029\", \"uri\": \"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/img-classification/874401645461/img-classification-20220725145617/Transform_-7372794461505454080/transformed_examples\", \"properties\": {\"split_names\": {\"string_value\": \"[\\\"train\\\", \\\"eval\\\"]\"}}, \"custom_properties\": {\"tfx_version\": {\"struct_value\": {\"__value__\": \"1.9.0\"}}}}, \"artifact_type\": {\"name\": \"Examples\", \"properties\": {\"span\": \"INT\", \"version\": \"INT\", \"split_names\": \"STRING\"}, \"base_type\": \"DATASET\"}, \"__artifact_class_module__\": \"tfx.types.standard_artifacts\", \"__artifact_class_name__\": \"Examples\"}], \"transform_graph\": [{\"artifact\": {\"id\": \"7122557137885461129\", \"uri\": \"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/img-classification/874401645461/img-classification-20220725145617/Transform_-7372794461505454080/transform_graph\", \"custom_properties\": {\"tfx_version\": {\"struct_value\": {\"__value__\": \"1.9.0\"}}}}, \"artifact_type\": {\"name\": \"TransformGraph\"}, \"__artifact_class_module__\": \"tfx.types.standard_artifacts\", \"__artifact_class_name__\": \"TransformGraph\"}]}",
      "--outputs",
      "{\"best_hyperparameters\": [{\"artifact\": {\"id\": \"6837211415839241726\", \"uri\": \"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/img-classification/874401645461/img-classification-20220725145617/Tuner_6462263593776709632/best_hyperparameters\"}, \"artifact_type\": {\"name\": \"HyperParameters\"}, \"__artifact_class_module__\": \"tfx.types.standard_artifacts\", \"__artifact_class_name__\": \"HyperParameters\"}]}",
      "--exec-properties",
      "{\"custom_config\": \"{\\\"ai_platform_tuning_args\\\": {\\\"masterConfig\\\": {\\\"imageUri\\\": \\\"gcr.io/gcp-ml-172005/img-classification\\\"}, \\\"project\\\": \\\"gcp-ml-172005\\\", \\\"region\\\": \\\"us-central1\\\", \\\"scaleTier\\\": \\\"STANDARD_1\\\"}, \\\"masterConfig\\\": {\\\"imageUri\\\": \\\"gcr.io/gcp-ml-172005/img-classification\\\"}, \\\"project\\\": \\\"gcp-ml-172005\\\", \\\"region\\\": \\\"us-central1\\\", \\\"remote_trials_working_dir\\\": \\\"gs://gcp-ml-172005-complete-mlops/tfx_pipeline_output/img-classification/trials\\\", \\\"scaleTier\\\": \\\"STANDARD_1\\\"}\", \"eval_args\": \"{\\n  \\\"num_steps\\\": 4\\n}\", \"train_args\": \"{\\n  \\\"num_steps\\\": 160\\n}\", \"tune_args\": \"{\\n  \\\"num_parallel_trials\\\": 3\\n}\", \"tuner_fn\": \"models.model.cloud_tuner_fn\"}"
    ]
  }
}
ravitejasssihl commented 2 years ago

Hello, Am Ravi,as part of a collage assignment am interested in solving this issue.For which I need your approval and guidance.Can you accept me as a contributer?