Open Haris-Ali007 opened 2 years ago
Sorry for the late reply
2022-09-07 07:32:12.549446: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "UNAVAILABLE: Socket closed" and grpc_error_string = "{"created":"@1662535932.549287619","description":"Error received from peer ipv4:10.115.174.50:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
means that there is a crash on the server side and given the log below it seems like it is happening in the graph compilating time. Since you are running on the colab it is using TPU Node which is an old architure the error message is really vague. The best way I can see to debug this issue is to run it on TPUVM(checkout https://cloud.google.com/tpu/docs/pytorch-xla-ug-tpu-vm).
RPC failed while running stylegan3 code on tpu-pytorch
Hello everyone. I was trying to shift the code of stylegan3 by NVLabs to TPU to speed up the processing. The code is being executed on the colab. I mostly commented the things that were causing problems in the original script and were not required in the initial stage. However, I am stuck in this issue and can't figure out the solution.
To Reproduce
Clone https://github.com/NVlabs/stylegan3 and paste the code in the training/training_loop.py
Error
System Info