Open yongzhe2160 opened 4 years ago
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CMLE
TensorFlow installed from (source or binary): CMLE
TensorFlow version (use command below): 1.14
Python version: 2.7
Please provide the entire URL of the model you are using?
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md
Describe the current behavior Failed after ~20min. Retired a few times, same error.
Describe the expected behavior Training succeeds.
Code to reproduce the issue
Other info / logs
{ insertId: "gng8v7fh35gc1" jsonPayload: { created: 1587446734.441076 levelname: "ERROR" lineno: 328 message: "RuntimeError: There was no new checkpoint after the training. Eval status: missing checkpoint" pathname: "/runcloudml.py" } labels: { compute.googleapis.com/resource_id: "3602514279639793991" compute.googleapis.com/resource_name: "gke-cml-0421-050704--n1-standard-8-30-48740b2f-bhw1" compute.googleapis.com/zone: "us-central1-c" ml.googleapis.com/job_id/log_area: "root" ml.googleapis.com/trial_id: "" } logName: "projects/yongzhe-test/logs/master-replica-0" receiveTimestamp: "2020-04-21T05:25:37.672555504Z" resource: { labels: { job_id: "yongzhe_object_detection_pets_04_20_2020_22_07_01" project_id: "yongzhe-test" task_name: "master-replica-0" } type: "ml_job" } severity: "ERROR" timestamp: "2020-04-21T05:25:34.441076039Z" }
any solutions to this issue?
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CMLE
TensorFlow installed from (source or binary): CMLE
TensorFlow version (use command below): 1.14
Python version: 2.7
Please provide the entire URL of the model you are using?
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md
Describe the current behavior Failed after ~20min. Retired a few times, same error.
Describe the expected behavior Training succeeds.
Code to reproduce the issue
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md
Other info / logs
{ insertId: "gng8v7fh35gc1"
jsonPayload: { created: 1587446734.441076
levelname: "ERROR"
lineno: 328
message: "RuntimeError: There was no new checkpoint after the training. Eval status: missing checkpoint"
pathname: "/runcloudml.py"
} labels: { compute.googleapis.com/resource_id: "3602514279639793991"
compute.googleapis.com/resource_name: "gke-cml-0421-050704--n1-standard-8-30-48740b2f-bhw1"
compute.googleapis.com/zone: "us-central1-c"
ml.googleapis.com/job_id/log_area: "root"
ml.googleapis.com/trial_id: ""
} logName: "projects/yongzhe-test/logs/master-replica-0"
receiveTimestamp: "2020-04-21T05:25:37.672555504Z"
resource: { labels: { job_id: "yongzhe_object_detection_pets_04_20_2020_22_07_01"
project_id: "yongzhe-test"
task_name: "master-replica-0"
} type: "ml_job"
} severity: "ERROR"
timestamp: "2020-04-21T05:25:34.441076039Z"
}