google-deepmind / xmanager

A platform for managing machine learning experiments
Apache License 2.0
816 stars 45 forks source link

JOB_STATE_FAILED for cifar10_tensorflow #19

Open nayakanuj opened 2 years ago

nayakanuj commented 2 years ago

I am unable to launch an example script. Following is the command and console output/Error. I am running the command from PyCharm terminal. The job is launched but fails immediately with "JOB_STATE_FAILED" error.

% sudo xmanager launch ./examples/cifar10_tensorflow/launcher.py

Console output + Error (a part of it): [+] Building 0.5s (16/16) FINISHED
=> [internal] load build definition from Dockerfile 0.0s => => transferring dockerfile: 694B 0.0s => [internal] load .dockerignore 0.0s => => transferring context: 2B 0.0s => [internal] load metadata for gcr.io/deeplearning-platform-release/tf2-gpu.2-6:latest 0.4s => [ 1/11] FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-6@sha256:<"a bunch of HEX digits"> 0.0s => [internal] load build context 0.0s => => transferring context: 8.07kB 0.0s => CACHED [ 2/11] RUN if ! id 1000; then useradd -m -u 1000 clouduser; fi 0.0s => CACHED [ 3/11] RUN apt-get update && apt-get install -y git netcat 0.0s => CACHED [ 4/11] RUN python -m pip install --upgrade pip 0.0s => CACHED [ 5/11] COPY cifar10_tensorflow/requirements.txt /cifar10_tensorflow/requirements.txt 0.0s => CACHED [ 6/11] RUN python -m pip install -r cifar10_tensorflow/requirements.txt 0.0s => CACHED [ 7/11] COPY cifar10_tensorflow/ /cifar10_tensorflow 0.0s => CACHED [ 8/11] RUN chown -R 1000:root /cifar10_tensorflow && chmod -R 775 /cifar10_tensorflow 0.0s => CACHED [ 9/11] WORKDIR cifar10_tensorflow 0.0s => CACHED [10/11] COPY entrypoint.sh ./entrypoint.sh 0.0s => CACHED [11/11] RUN chown -R 1000:root ./entrypoint.sh && chmod -R 775 ./entrypoint.sh 0.0s => exporting to image 0.0s => => exporting layers
... {"status":"Waiting","progressDetail":{},"id": .... {"status":"Layer already exists","progressDetail":{},"id": .... Your image URI is: Job launched at: https://console.cloud.google.com/ai/platform/locations//training/ current state: JobState.JOB_STATE_QUEUED current state: JobState.JOB_STATE_PENDING current state: JobState.JOB_STATE_FAILED