T5x fine-tuning on GPU - Githubissues

Hi,

I am trying to run t5x fine-tuning on a GPU A6000 server. I followed the instructions I found and failed with this error on both two nodes: 2022-10-23 01:07:26.771566: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Ex ecution of replica 1 failed: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_all_reduce_thunk.c c:70: NCCL operation ncclAllReduce(send_buffer, recv_buffer, element_count, dtype, reduce_op, comm, gpu_stream) failed: un handled cuda error Traceback (most recent call last): File "/home/yihan/data/token/t5x/t5x/train.py", line 757, in <module> gin_utils.run(main) File "/nfs/data/yihan/token/t5x/t5x/gin_utils.py", line 107, in run app.run( File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/site-packages/absl_py-1.3.0-py3.8.egg/absl/app.py", line 308, in run _run_main(main, args) File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/site-packages/absl_py-1.3.0-py3.8.egg/absl/app.py", line 254, in _ru n_main sys.exit(main(argv)) File "/home/yihan/data/token/t5x/t5x/train.py", line 717, in main _main(argv) File "/home/yihan/data/token/t5x/t5x/train.py", line 753, in _main train_using_gin() File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/site-packages/gin_config-0.5.0-py3.8.egg/gin/config.py", line 1605, in gin_wrapper utils.augment_exception_message_and_reraise(e, err_str) File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/site-packages/gin_config-0.5.0-py3.8.egg/gin/utils.py", line 41, in augment_exception_message_and_reraise raise proxy.with_traceback(exception.__traceback__) from None File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/site-packages/gin_config-0.5.0-py3.8.egg/gin/config.py", line 1582, in gin_wrapper return fn(*new_args, **new_kwargs) File "/home/yihan/data/token/t5x/t5x/train.py", line 220, in train random_seed = multihost_utils.broadcast_one_to_all(np.int32(time.time())) File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/site-packages/jax-0.3.23-py3.8.egg/jax/experimental/multihost_utils.py", line 75, in broadcast_one_to_all in_tree = jax.device_get(_psum(in_tree)) jaxlib.xla_extension.XlaRuntimeError: INTERNAL: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_all_reduce_thunk.cc:70: NCCL operation ncclAllReduce(send_buffer, recv_buffer, element_count, dtype, reduce_op, comm, gpu_stream) failed: unhandled cuda error: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well). In call to configurable 'train' (<function train at 0x7fadc2021670>)

I am using python3.8, tensorflow 2.10.0, cuda 11.5 and jax 0.3.23.

Could you help to figure out the problem? Thanks!

This problem is solved by updating cudnn version to 8.2. But I encountered another error when saving the first checkpoint to google cloud storage:

File "/nfs/data/yihan/token/t5x/t5x/checkpoints.py", line 738, in save subprocess.run(['gsutil', '-m', 'mv', tmp_dir, final_dir], File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['gsutil', '-m', 'mv', 'gs://t5x_pretrained_models/models/checkpoint_1000000.tmp-1666556700', 'gs://t5x_pretrained_models/models/checkpoint_1000000']' returned non-zero exit status 1. In call to configurable 'train' (<function train at 0x7f9fd9ecf670>) 2022-10-23 13:26:13.669308: E external/org_tensorflow/tensorflow/core/distributed_runtime/coordination/coordination_service_agent.cc:486] Failed to disconnect from coordination service with status: DEADLINE_EXCEEDED: Deadline Exceeded Additional GRPC error information from remote target unknown_target_for_coordination_leader: :{"created":"@1666556773.669059414","description":"Error received from peer ipv4:131.179.88.219:1456","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4}. Proceeding with agent shutdown anyway. Error in atexit._run_exitfuncs: Traceback (most recent call last): File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/site-packages/jax-0.3.23-py3.8.egg/jax/_src/distributed.py", line 167, in shutdown global_state.shutdown() File "/home/yihan/miniconda3/envs/t5x/lib/python3.8/site-packages/jax-0.3.23-py3.8.egg/jax/_src/distributed.py", line 86, in shutdown self.client.shutdown() jaxlib.xla_extension.XlaRuntimeError: DEADLINE_EXCEEDED: Deadline Exceeded Additional GRPC error information from remote target unknown_target_for_coordination_leader: :{"created":"@1666556773.669059414","description":"Error received from peer ipv4:131.179.88.219:1456","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Deadline Exceeded","grpc_status":4} 2022-10-23 13:26:14.132046: E external/org_tensorflow/tensorflow/core/distributed_runtime/coordination/coordination_service.cc:1127] Shutdown barrier in coordination service has failed: DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: Shutdown::11785556176882848577 [type.googleapis.com/tensorflow.CoordinationServiceError='']. This suggests that at least one worker did not complete its job, or was too slow/hanging in its execution. 2022-10-23 13:26:14.132127: E external/org_tensorflow/tensorflow/core/distributed_runtime/coordination/coordination_service.cc:729] INTERNAL: Shutdown barrier has been passed with status: 'DEADLINE_EXCEEDED: Barrier timed out. Barrier_id: Shutdown::11785556176882848577 [type.googleapis.com/tensorflow.CoordinationServiceError='']', but this task is not at the barrier yet. [type.googleapis.com/tensorflow.CoordinationServiceError=''] 2022-10-23 13:26:14.132267: E external/org_tensorflow/tensorflow/core/distributed_runtime/coordination/coordination_service.cc:476] Stopping coordination service as shutdown barrier timed out and there is no service-to-client connection.

google-research / t5x

T5x fine-tuning on GPU #864