Open YeBin2018 opened 8 months ago
I tried using the default value of cross_device_ops, and now it get stuck, repeating print log "Local rendezvous recv item cancelled. Key hash: 15504120126296904051". Anyone knows something about this?
Hi @YeBin2018 ,
Could you please provide the reproducible code/colab notebook to provide the support and provide the environment details to get complete understanding of the issue you are facing.Meanwhile For support-related issues, consider seeking assistance from the dedicated research models forum on TensorFlow Forum and StockoverFlow.These forum benefits from a large user base, increasing the potential for a swift resolution to your technical inquiry.
Thanks
Sorry, it is not convenient to provide the source code because it may involve company secrets. Our environment is: H800 machine, one machine has eight cards, using all-reduce architecture. The version of tensorflow is 2.14, using the Docker image provided by NVIDIA. I want to know what does it mean to print this log repeatedly? Because I looked at the tensorflow source code, it is difficult to trace the cause of this log -- “I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash:”
when I use the default value of cross_device_ops, it'll core dump in jemalloc as below. when I choose cross_device_ops=tf.distribute.ReductionToOneDevice(), it still dosen't work, it get stuck. Does anyone know how to solve it?