tensorflow / models

Models and examples built with TensorFlow
Other
77.03k stars 45.77k forks source link

tf1 upgrade to tf2,tf.distribute.MirroredStrategy core dump #11154

Open YeBin2018 opened 8 months ago

YeBin2018 commented 8 months ago

when I use the default value of cross_device_ops, it'll core dump in jemalloc as below. when I choose cross_device_ops=tf.distribute.ReductionToOneDevice(), it still dosen't work, it get stuck. Does anyone know how to solve it?

02-01 21:57:21.779 E0201 21:57:21.779611 2689 log.cpp:10] @ 0x7fa3d0809ec6 _ZN4brpc5PrintERSoP6ssl_stPKc.cold  
  02-01 21:57:21.770 E0201 21:57:21.769979 2689 log.cpp:10] @ 0x7fa3d0809c37 _ZN4brpc19NamingServiceThread7Actions12ResetServersERKSt6vectorINS_10ServerNodeESaIS3_EE.cold  
  02-01 21:57:21.757 E0201 21:57:21.757550 2689 log.cpp:10] @ 0x7fa3d08098de _ZSt16introsort_loopIN9__gnu_cxx17normal_iteratorIPN4brpc10ServerNodeESt6vectorIS3_SaIS3_EEEElNS0_5__ops15_Iter_less_iterEEvT_SB_T0T1.isra.0.cold  
  02-01 21:57:21.722 E0201 21:57:21.722803 2689 log.cpp:10] @ 0x7fa3d072f46c (unknown)  
  02-01 21:57:21.698 E0201 21:57:21.697870 2689 log.cpp:10] @ 0x7faa6f5b5768 do_rallocx  
  02-01 21:57:21.660 E0201 21:57:21.660003 2689 log.cpp:10] @ 0x7faa6f642703 prof_recent_alloc_restore_locked.isra.0  
  02-01 21:57:21.626 E0201 21:57:21.626035 2689 log.cpp:10] @ 0x7faa6f5c3b12 realloc  
  02-01 21:57:21.600 E0201 21:57:21.600236 2689 log.cpp:10] @ 0x7faa6f62ab0b hpa_try_alloc_batch_no_grow  
  02-01 21:57:21.578 E0201 21:57:21.577888 2689 log.cpp:10] @ 0x7faa6f62a381 hpa_shard_maybe_do_deferred_work  
  02-01 21:57:21.558 E0201 21:57:21.558209 2689 log.cpp:10] @ 0x7faa6f61cec1 je_edata_avail_remove_first  
  02-01 21:57:21.530 E0201 21:57:21.530472 2689 log.cpp:10] @ 0x7faa6f62b3ac hpa_alloc  
  02-01 21:57:21.496 E0201 21:57:21.496141 2689 log.cpp:10] @ 0x7faa6f33fdac (unknown)  
  02-01 21:57:21.439 E0201 21:57:21.439801 2689 log.cpp:10] @ 0x7faa6f263520 (unknown)  
  02-01 21:57:21.411 E0201 21:57:21.411706 2689 log.cpp:10] SIGSEGV (@0x0) received by PID 117 (TID 0x7f8630c6c640) from PID 0; stack trace: 全部分词  
  02-01 21:57:21.411 E0201 21:57:21.411311 2689 log.cpp:10] PC: @ 0x0 (unknown)  
  02-01 21:57:21.377 E0201 21:57:21.377116 2689 log.cpp:10] Aborted at 1706795841 (unix time) try "date -d @1706795841" if you are using GNU date
YeBin2018 commented 8 months ago

I tried using the default value of cross_device_ops, and now it get stuck, repeating print log "Local rendezvous recv item cancelled. Key hash: 15504120126296904051". Anyone knows something about this?

laxmareddyp commented 8 months ago

Hi @YeBin2018 ,

Could you please provide the reproducible code/colab notebook to provide the support and provide the environment details to get complete understanding of the issue you are facing.Meanwhile For support-related issues, consider seeking assistance from the dedicated research models forum on TensorFlow Forum and StockoverFlow.These forum benefits from a large user base, increasing the potential for a swift resolution to your technical inquiry.

Thanks

YeBin2018 commented 8 months ago

Sorry, it is not convenient to provide the source code because it may involve company secrets. Our environment is: H800 machine, one machine has eight cards, using all-reduce architecture. The version of tensorflow is 2.14, using the Docker image provided by NVIDIA. I want to know what does it mean to print this log repeatedly? Because I looked at the tensorflow source code, it is difficult to trace the cause of this log -- “I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash:”