google / paxml

Pax is a Jax-based machine learning framework for training large scale models. Pax allows for advanced and fully configurable experimentation and parallelization, and has demonstrated industry leading model flop utilization rates.
Apache License 2.0
458 stars 69 forks source link

DEADLINE_EXCEEDED on 1024 GPUs. #77

Open mhugues opened 7 months ago

mhugues commented 7 months ago

Additional GRPC error information from remote target unknown_target_for_coordination_leader while calling /tensorflow.CoordinationService/RegisterTask: :{"created":"@1712965181.656280441","description":"Deadline Exceeded","file":"external/com_github_grpc_grpc/src/core/ext/filters/deadline/deadline_filter.cc","file_line":69,"grpc_status":4} 2024-04-12 23:39:41.656900: E external/xla/xla/pjrt/distributed/client.cc:96] Coordination service agent in error status: DEADLINE_EXCEEDED: Deadline Exceeded

Did anyone see that issue?