tf1 upgrade to tf2，tf.distribute.MirroredStrategy core dump

YeBin2018 commented 8 months ago

when I use the default value of cross_device_ops, it'll core dump in jemalloc as below. when I choose cross_device_ops=tf.distribute.ReductionToOneDevice(), it still dosen't work, it get stuck. Does anyone know how to solve it?

02-01 21:57:21.779	E0201 21:57:21.779611 2689 log.cpp:10] @ 0x7fa3d0809ec6 _ZN4brpc5PrintERSoP6ssl_stPKc.cold
	02-01 21:57:21.770	E0201 21:57:21.769979 2689 log.cpp:10] @ 0x7fa3d0809c37 _ZN4brpc19NamingServiceThread7Actions12ResetServersERKSt6vectorINS_10ServerNodeESaIS3_EE.cold
	02-01 21:57:21.757	E0201 21:57:21.757550 2689 log.cpp:10] @ 0x7fa3d08098de _ZSt16introsort_loopIN9__gnu_cxx17normal_iteratorIPN4brpc10ServerNodeESt6vectorIS3_SaIS3_EEEElNS0_5__ops15_Iter_less_iterEEvT_SB_T0T1.isra.0.cold
	02-01 21:57:21.722	E0201 21:57:21.722803 2689 log.cpp:10] @ 0x7fa3d072f46c (unknown)
	02-01 21:57:21.698	E0201 21:57:21.697870 2689 log.cpp:10] @ 0x7faa6f5b5768 do_rallocx
	02-01 21:57:21.660	E0201 21:57:21.660003 2689 log.cpp:10] @ 0x7faa6f642703 prof_recent_alloc_restore_locked.isra.0
	02-01 21:57:21.626	E0201 21:57:21.626035 2689 log.cpp:10] @ 0x7faa6f5c3b12 realloc
	02-01 21:57:21.600	E0201 21:57:21.600236 2689 log.cpp:10] @ 0x7faa6f62ab0b hpa_try_alloc_batch_no_grow
	02-01 21:57:21.578	E0201 21:57:21.577888 2689 log.cpp:10] @ 0x7faa6f62a381 hpa_shard_maybe_do_deferred_work
	02-01 21:57:21.558	E0201 21:57:21.558209 2689 log.cpp:10] @ 0x7faa6f61cec1 je_edata_avail_remove_first
	02-01 21:57:21.530	E0201 21:57:21.530472 2689 log.cpp:10] @ 0x7faa6f62b3ac hpa_alloc
	02-01 21:57:21.496	E0201 21:57:21.496141 2689 log.cpp:10] @ 0x7faa6f33fdac (unknown)
	02-01 21:57:21.439	E0201 21:57:21.439801 2689 log.cpp:10] @ 0x7faa6f263520 (unknown)
	02-01 21:57:21.411	E0201 21:57:21.411706 2689 log.cpp:10] *SIGSEGV (@0x0) received by PID 117 (TID 0x7f8630c6c640) from PID 0; stack trace:* 全部分词
	02-01 21:57:21.411	E0201 21:57:21.411311 2689 log.cpp:10] PC: @ 0x0 (unknown)
	02-01 21:57:21.377	E0201 21:57:21.377116 2689 log.cpp:10] *Aborted at 1706795841 (unix time) try "date -d @1706795841" if you are using GNU date*

YeBin2018 commented 8 months ago

I tried using the default value of cross_device_ops, and now it get stuck, repeating print log "Local rendezvous recv item cancelled. Key hash: 15504120126296904051". Anyone knows something about this?

laxmareddyp commented 8 months ago

Hi @YeBin2018 ,

Could you please provide the reproducible code/colab notebook to provide the support and provide the environment details to get complete understanding of the issue you are facing.Meanwhile For support-related issues, consider seeking assistance from the dedicated research models forum on TensorFlow Forum and StockoverFlow.These forum benefits from a large user base, increasing the potential for a swift resolution to your technical inquiry.

Thanks

YeBin2018 commented 8 months ago

Sorry, it is not convenient to provide the source code because it may involve company secrets. Our environment is: H800 machine, one machine has eight cards, using all-reduce architecture. The version of tensorflow is 2.14, using the Docker image provided by NVIDIA. I want to know what does it mean to print this log repeatedly? Because I looked at the tensorflow source code, it is difficult to trace the cause of this log -- “I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash:”

tensorflow / models

tf1 upgrade to tf2，tf.distribute.MirroredStrategy core dump #11154