tensorflow / tpu

Reference models and tools for Cloud TPUs.
https://cloud.google.com/tpu/
Apache License 2.0
5.21k stars 1.77k forks source link

mask_rcnn model hangs in train_and_eval mode in cycle 0 when run in a GKE pod with cloud TPU #288

Open ajayvohra2005 opened 5 years ago

ajayvohra2005 commented 5 years ago

Tensorflow version 1.12 GKE pod with cloud tpu v3-8 TPU github hash: 64f3b5f9582bc484daa1c13c3d9edc0fc7127b05

Here is the Dockerfile used to build container image:

**FROM gcr.io/tensorflow/tpu-models:r1.12

RUN sudo apt-get update -y RUN sudo apt-get install dialog apt-utils -y RUN echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections RUN sudo apt-get install -y python-dev RUN sudo apt-get install -y python-tk RUN sudo apt-get install -y libglib2.0-0 RUN sudo apt-get install dnsutils -y RUN pip install --upgrade pip RUN pip install Cython matplotlib RUN pip install 'git+https://github.com/cocodataset/cocoapi#egg=pycocotools&subdirectory=PythonAPI' RUN pip install opencv-python-headless RUN pip install pyyaml

RUN git clone https://github.com/tensorflow/tpu.git /tpu RUN cd /tpu && git fetch origin 64f3b5f9582bc484daa1c13c3d9edc0fc7127b05 RUN cd /tpu && git reset --hard 64f3b5f9582bc484daa1c13c3d9edc0fc7127b05**

Here is the job spec:

**apiVersion: batch/v1 kind: Job metadata: name: mask-rcnn-tpu spec: template: metadata: annotations:

The Cloud TPUs that will be created for this Job must support

    # TensorFlow 1.12. This version MUST match
    # the TensorFlow version that your model is built on.
    tf-version.cloud-tpus.google.com: "1.12"
spec:
  restartPolicy: Never
  containers:
  - name: mask-rcnn-tpu
    image: gcr.io/mask-rcnn-tutorial/tpu-models:r1.12-64f3b5f9582bc484daa1c13c3d9edc0fc7127b05
    imagePullPolicy: Always
    workingDir: /tpu/models/experimental/mask_rcnn
    command:
    - python
    - mask_rcnn_main.py
    - --use_tpu=True
    - --num_cores=8
    - --model_dir=gs://my-bucket/mask-rcnn-model-24-eval-6-gke
    - --iterations_per_loop=1875
    - --mode=train_and_eval
    - --config=resnet_checkpoint=gs://cloud-tpu-artifacts/resnet/resnet-nhwc-2018-10-14/model.ckpt-112602,resnet_depth=50,use_bfloat16=true,train_batch_size=64,eval_batch_size=8,training_file_pattern=gs://my-bucket/coco/train-*,validation_file_pattern=gs://my-bucket/coco/val-*,val_json_file=gs://my-bucket/coco/instances_val2017.json,total_steps=45000,num_steps_per_eval=11250
    resources:
      limits:
          cloud-tpus.google.com/v3: 8**

The training runs for about 30 minutes, but never completes cycle 0. Below are the last few lines of stdout log from pod where it gets stuck. The TPU utilization falls to 0. The memory remains constant around 50 GB. There is no Network activity. I logged onto the pod and could not find anything that could explain the stuck python mask_rcnn_main.py process.

INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Done calling model_fn. INFO:tensorflow:TPU job name worker INFO:tensorflow:Graph was finalized. INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into gs://ajayvohra-mrcnn-central1/mask-rcnn-model-24-eval-6-gke/model.ckpt. INFO:tensorflow:Installing graceful shutdown hook. 2019-02-20 16:10:43.635366: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initial ize the session with an empty graph and other defaults because the session has not yet been created. INFO:tensorflow:Creating heartbeat manager for ['/job:tpu_worker/replica:0/task:0/device:CPU:0', '/job:tpu_worker/replica:0/task :0/device:XLA_CPU:0'] WARNING:tensorflow:Worker heartbeats not supported by all workers. No failure handling will be enabled. INFO:tensorflow:Init TPU system INFO:tensorflow:Starting infeed thread controller. INFO:tensorflow:Starting outfeed thread controller. INFO:tensorflow:Enqueue next (1875) batch(es) of data to infeed. INFO:tensorflow:Dequeue next (1875) batch(es) of data from outfeed.

gitosaurus commented 5 years ago

You may need a longer-lived connection to the Compute Engine instance. Try updating the keep-alive parameters.

rowanz commented 5 years ago

I run into the same issue during evaluation. When following the tutorial here, and running train_and_eval mode on a v3-8 TPU, the training finishes successfully, but evaluation hangs silently.

Here are the logging messages I get, if it helps:

W0814 16:34:55.394186 140354355455744 preempted_hook.py:89] TPUPollingThread found TPU vlm4 in state READY, and health HEALTHY.
I0814 16:35:09.865813 140356188689856 basic_session_run_hooks.py:606] Saving checkpoints for 2500 into gs://vlmodel/mask-rcnn/stock-try1/model.ckpt.
I0814 16:35:21.922967 140356188689856 basic_session_run_hooks.py:262] loss = 1.322376, step = 2500
I0814 16:35:23.072418 140356188689856 tpu_estimator.py:598] Stop infeed thread controller
I0814 16:35:23.072729 140356188689856 tpu_estimator.py:430] Shutting down InfeedController thread.
I0814 16:35:23.074306 140354575648512 tpu_estimator.py:425] InfeedController received shutdown signal, stopping.
I0814 16:35:23.074624 140354575648512 tpu_estimator.py:530] Infeed thread finished, shutting down.
I0814 16:35:23.074912 140356188689856 error_handling.py:96] infeed marked as finished
I0814 16:35:23.075222 140356188689856 tpu_estimator.py:602] Stop output thread controller
I0814 16:35:23.075303 140356188689856 tpu_estimator.py:430] Shutting down OutfeedController thread.
I0814 16:35:23.075495 140354363848448 tpu_estimator.py:425] OutfeedController received shutdown signal, stopping.
I0814 16:35:23.075690 140354363848448 tpu_estimator.py:541] Outfeed thread finished, shutting down.
I0814 16:35:23.075850 140356188689856 error_handling.py:96] outfeed marked as finished
I0814 16:35:23.075973 140356188689856 tpu_estimator.py:606] Shutdown TPU system.
I0814 16:35:26.706342 140356188689856 estimator.py:368] Loss for final step: 1.322376.
I0814 16:35:26.707267 140356188689856 error_handling.py:96] training_loop marked as finished
I0814 16:35:26.707381 140356188689856 distributed_executer.py:224] Start evaluation cycle 0.

And then it hangs after that -- though CPU usage from the python process is still at 100%, so it's possible the local process is trying to connect to the TPU which has stopped.

Any ideas how to debug this? My suspicion is that there's something weird in the dataloader, but I'm not sure how to debug that because it's being ran on the TPU.