tensorflow / models

Models and examples built with TensorFlow
Other
76.92k stars 45.81k forks source link

How to fine-tune bert with MultiWorkerMirroredStrategy? #7560

Closed frankinwi closed 2 years ago

frankinwi commented 4 years ago

System information What is the top-level directory of the model you are using:bert Have I written custom code (as opposed to using a stock example script provided in TensorFlow):No OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 18.04 TensorFlow installed from (source or binary):binary TensorFlow version (use command below):2.0.0-rc0 Bazel version (if compiling from source):NO CUDA/cuDNN version:10 GPU model and memory:TITAN V

Describe the problem I'm trying to fine-tune bert with MultiWorkerMirroredStrategy. I have 2 machines, each with a TITAN V GPU. I follow this guide and set the TF_CONFIG environment variable as follows:

Machine1: { "task": { "type": "worker", "index": 0 }, "cluster": { "worker": ["172.17.99.2:2222","172.17.99.4:2222"] } }

Machine2: { "task": { "type": "worker", "index": 1 }, "cluster": { "worker": ["172.17.99.2:2222","172.17.99.4:2222"] } }

The script is as follow: python run_squad.py --do_lower_case=False --mode='train' --input_meta_data_path=${SQUAD_DIR}/squad_${SQUAD_VERSION}_meta_data --train_data_path=${SQUAD_DIR}/squad_${SQUAD_VERSION}_train.tf_record --predict_file=$SQUAD_DIR/dev-v1.1.json --vocab_file=${BERT_BASE_DIR}/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt --train_batch_size=4 --predict_batch_size=4 --learning_rate=8e-5 --num_train_epochs=2 --model_dir=${RESULT} --strategy_type=multi_worker_mirror

Source code / logs The log on Machine1: 2019-09-16 00:50:18.112676: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2019-09-16 00:50:18.172982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:b3:00.0 2019-09-16 00:50:18.173271: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-09-16 00:50:18.174912: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-09-16 00:50:18.176467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2019-09-16 00:50:18.177222: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2019-09-16 00:50:18.179420: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2019-09-16 00:50:18.180971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2019-09-16 00:50:18.185364: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-09-16 00:50:18.187935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-09-16 00:50:18.188359: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-09-16 00:50:18.196610: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz 2019-09-16 00:50:18.196801: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x57397f0 executing computations on platform Host. Devices: 2019-09-16 00:50:18.196817: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version 2019-09-16 00:50:18.540526: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x573c010 executing computations on platform CUDA. Devices: 2019-09-16 00:50:18.540557: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): TITAN V, Compute Capability 7.0 2019-09-16 00:50:18.541956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:b3:00.0 2019-09-16 00:50:18.541998: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-09-16 00:50:18.542012: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-09-16 00:50:18.542024: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2019-09-16 00:50:18.542037: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2019-09-16 00:50:18.542049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2019-09-16 00:50:18.542061: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2019-09-16 00:50:18.542074: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-09-16 00:50:18.544425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-09-16 00:50:18.544460: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-09-16 00:50:18.546390: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-16 00:50:18.546406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2019-09-16 00:50:18.546413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2019-09-16 00:50:18.549092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11036 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:b3:00.0, compute capability: 7.0) 2019-09-16 00:50:18.551454: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:b3:00.0 2019-09-16 00:50:18.551495: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-09-16 00:50:18.551509: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-09-16 00:50:18.551522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2019-09-16 00:50:18.551534: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2019-09-16 00:50:18.551546: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2019-09-16 00:50:18.551558: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2019-09-16 00:50:18.551570: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-09-16 00:50:18.554730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-09-16 00:50:18.554757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-16 00:50:18.554766: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2019-09-16 00:50:18.554774: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2019-09-16 00:50:18.557328: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:worker/replica:0/task:0/device:GPU:0 with 11036 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:b3:00.0, compute capability: 7.0) 2019-09-16 00:50:18.559001: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222, 1 -> 172.17.99.4:2222} 2019-09-16 00:50:18.559526: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:2222 I0916 00:50:18.560058 139774642997056 collective_all_reduce_strategy.py:269] Enabled multi-worker collective ops with available devices: ['/job:worker/replica:0/task:0/device:CPU:0', '/job:worker/replica:0/task:0/device:XLA_CPU:0', '/job:worker/replica:0/task:0/device:XLA_GPU:0', '/job:worker/replica:0/task:0/device:GPU:0'] I0916 00:50:18.561196 139774642997056 collective_all_reduce_strategy.py:310] Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['172.17.99.2:2222', '172.17.99.4:2222']}, task_type = 'worker', task_id = 0, num_workers = 2, local_devices = ('/job:worker/task:0/device:GPU:0',), communication = CollectiveCommunication.AUTO I0916 00:50:18.561356 139774642997056 run_squad.py:191] Training using customized training loop with distribution strategy. I0916 00:50:30.909670 139774642997056 model_training_utils.py:209] Checkpoint file /home/xjzhang/xjtu/bert-chinese-ner/model/tf2.0/cased_L-12_H-768_A-12/bert_model.ckpt found and restoring from initial checkpoint for core model. I0916 00:50:32.015406 139774642997056 model_training_utils.py:212] Loading from checkpoint file completed W0916 00:50:37.113903 139753596905216 deprecation.py:323] From /home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0916 00:50:37.360016 139753596905216 optimizer_v2.py:1029] Gradients do not exist for variables ['bert_model/pooler_transform/kernel:0', 'bert_model/pooler_transform/bias:0'] when minimizing the loss.

Due to the error occured in machine2, the process in machine1 is suspended.

The log on Machine2: 2019-09-16 00:50:13.026187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 2019-09-16 00:50:13.090962: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:b4:00.0 2019-09-16 00:50:13.091224: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-09-16 00:50:13.092815: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-09-16 00:50:13.094366: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2019-09-16 00:50:13.094706: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2019-09-16 00:50:13.096718: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2019-09-16 00:50:13.098325: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2019-09-16 00:50:13.102747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-09-16 00:50:13.105282: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-09-16 00:50:13.105720: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA 2019-09-16 00:50:13.114003: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz 2019-09-16 00:50:13.114196: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4802470 executing computations on platform Host. Devices: 2019-09-16 00:50:13.114213: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version 2019-09-16 00:50:13.432372: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4804c90 executing computations on platform CUDA. Devices: 2019-09-16 00:50:13.432404: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): TITAN V, Compute Capability 7.0 2019-09-16 00:50:13.433817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:b4:00.0 2019-09-16 00:50:13.433858: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-09-16 00:50:13.433873: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-09-16 00:50:13.433885: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2019-09-16 00:50:13.433898: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2019-09-16 00:50:13.433910: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2019-09-16 00:50:13.433923: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2019-09-16 00:50:13.433935: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-09-16 00:50:13.436333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-09-16 00:50:13.436364: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-09-16 00:50:13.438332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-16 00:50:13.438348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2019-09-16 00:50:13.438356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2019-09-16 00:50:13.441026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11036 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:b4:00.0, compute capability: 7.0) 2019-09-16 00:50:13.444723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: name: TITAN V major: 7 minor: 0 memoryClockRate(GHz): 1.455 pciBusID: 0000:b4:00.0 2019-09-16 00:50:13.444770: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2019-09-16 00:50:13.444784: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2019-09-16 00:50:13.444797: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2019-09-16 00:50:13.444810: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2019-09-16 00:50:13.444822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2019-09-16 00:50:13.444834: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2019-09-16 00:50:13.444847: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2019-09-16 00:50:13.447230: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0 2019-09-16 00:50:13.447254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-09-16 00:50:13.447263: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0 2019-09-16 00:50:13.447270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N 2019-09-16 00:50:13.450763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:worker/replica:0/task:1/device:GPU:0 with 11036 MB memory) -> physical GPU (device: 0, name: TITAN V, pci bus id: 0000:b4:00.0, compute capability: 7.0) 2019-09-16 00:50:13.452625: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:258] Initialize GrpcChannelCache for job worker -> {0 -> 172.17.99.2:2222, 1 -> localhost:2222} 2019-09-16 00:50:13.454023: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:365] Started server with target: grpc://localhost:2222 I0916 00:50:13.454599 139817975859008 collective_all_reduce_strategy.py:269] Enabled multi-worker collective ops with available devices: ['/job:worker/replica:0/task:1/device:CPU:0', '/job:worker/replica:0/task:1/device:XLA_CPU:0', '/job:worker/replica:0/task:1/device:XLA_GPU:0', '/job:worker/replica:0/task:1/device:GPU:0'] I0916 00:50:13.455738 139817975859008 collective_all_reduce_strategy.py:310] Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['172.17.99.2:2222', '172.17.99.4:2222']}, task_type = 'worker', task_id = 1, num_workers = 2, local_devices = ('/job:worker/task:1/device:GPU:0',), communication = CollectiveCommunication.AUTO I0916 00:50:13.455904 139817975859008 run_squad.py:191] Training using customized training loop with distribution strategy. I0916 00:50:30.909629 139817975859008 model_training_utils.py:209] Checkpoint file /home/xjzhang/xjtu/bert-chinese-ner/model/tf2.0/cased_L-12_H-768_A-12/bert_model.ckpt found and restoring from initial checkpoint for core model. I0916 00:50:32.015478 139817975859008 model_training_utils.py:212] Loading from checkpoint file completed W0916 00:50:37.128744 139797024732928 deprecation.py:323] From /home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/clip_ops.py:301: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0916 00:50:37.377812 139797024732928 optimizer_v2.py:1029] Gradients do not exist for variables ['bert_model/pooler_transform/kernel:0', 'bert_model/pooler_transform/bias:0'] when minimizing the loss. 2019-09-16 00:50:37.473745: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at collective_ops.cc:365 : Internal: RecvBufResponse returned 2359296 bytes where to_tensor expected 3072 I0916 00:50:37.480995 139797024732928 coordinator.py:219] Error reported to Coordinator: in converted code:

/home/xjzhang/xjtu/models/official/bert/model_training_utils.py:253 _replicated_step  *
    optimizer.apply_gradients(zip(grads, tvars))
/home/xjzhang/xjtu/models/official/bert/optimization.py:143 apply_gradients  *
    return super(AdamWeightDecay, self).apply_gradients(zip(grads, tvars))
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py:435 apply_gradients
    self._create_slots(var_list)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/adam.py:146 _create_slots
    self.add_slot(var, 'm')
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py:587 add_slot
    initial_value=initial_value)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:259 __call__
    return cls._variable_v2_call(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:253 _variable_v2_call
    shape=shape)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:1410 create_colocated_variable
    return next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/shared_variable_creator.py:69 create_new_variable
    v = next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:1322 creator_with_resource_vars
    return self._create_variable(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py:526 _create_variable
    values.SyncOnReadVariable, *args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:2288 create_mirrored_variable
    value_list = real_mirrored_creator(devices, *args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py:518 _real_mirrored_creator
    v = next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py:358 variable_capturing_scope
    lifted_initializer_graph=lifted_initializer_graph, **kwds)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:261 __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py:141 __init__
    initial_value() if init_from_fn else initial_value,
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/collective_all_reduce_strategy.py:347 initial_value_fn
    collective_instance_key)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/collective_ops.py:161 broadcast_recv
    instance_key=instance_key)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_collective_ops.py:66 collective_bcast_recv
    _six.raise_from(_core._status_to_exception(e.code, message), None)
<string>:3 raise_from

InternalError: RecvBufResponse returned 2359296 bytes where to_tensor expected 3072 [Op:CollectiveBcastRecv]

Traceback (most recent call last): File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/training/coordinator.py", line 297, in stop_on_exception yield File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py", line 879, in run self.main_result = self.main_fn(*self.main_args, **self.main_kwargs) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 237, in wrapper raise e.ag_error_metadata.to_exception(e) tensorflow.python.framework.errors_impl.InternalError: in converted code:

/home/xjzhang/xjtu/models/official/bert/model_training_utils.py:253 _replicated_step  *
    optimizer.apply_gradients(zip(grads, tvars))
/home/xjzhang/xjtu/models/official/bert/optimization.py:143 apply_gradients  *
    return super(AdamWeightDecay, self).apply_gradients(zip(grads, tvars))
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py:435 apply_gradients
    self._create_slots(var_list)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/adam.py:146 _create_slots
    self.add_slot(var, 'm')
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py:587 add_slot
    initial_value=initial_value)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:259 __call__
    return cls._variable_v2_call(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:253 _variable_v2_call
    shape=shape)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:1410 create_colocated_variable
    return next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/shared_variable_creator.py:69 create_new_variable
    v = next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:1322 creator_with_resource_vars
    return self._create_variable(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py:526 _create_variable
    values.SyncOnReadVariable, *args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:2288 create_mirrored_variable
    value_list = real_mirrored_creator(devices, *args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py:518 _real_mirrored_creator
    v = next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py:358 variable_capturing_scope
    lifted_initializer_graph=lifted_initializer_graph, **kwds)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:261 __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py:141 __init__
    initial_value() if init_from_fn else initial_value,
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/collective_all_reduce_strategy.py:347 initial_value_fn
    collective_instance_key)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/collective_ops.py:161 broadcast_recv
    instance_key=instance_key)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_collective_ops.py:66 collective_bcast_recv
    _six.raise_from(_core._status_to_exception(e.code, message), None)
<string>:3 raise_from

InternalError: RecvBufResponse returned 2359296 bytes where to_tensor expected 3072 [Op:CollectiveBcastRecv]

Traceback (most recent call last): File "/home/xjzhang/xjtu/models/official/bert/run_squad.py", line 372, in app.run(main) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/absl/app.py", line 300, in run _run_main(main, args) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/home/xjzhang/xjtu/models/official/bert/run_squad.py", line 364, in main train_squad(strategy, input_meta_data) File "/home/xjzhang/xjtu/models/official/bert/run_squad.py", line 250, in train_squad custom_callbacks=custom_callbacks) File "/home/xjzhang/xjtu/models/official/bert/model_training_utils.py", line 365, in run_customized_training_loop tf.convert_to_tensor(steps, dtype=tf.int32)) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 427, in call self._initialize(args, kwds, add_initializers_to=initializer_map) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 370, in _initialize *args, kwds)) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1847, in _get_concrete_function_internal_garbage_collected graphfunction, , _ = self._maybe_define_function(args, kwargs) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2147, in _maybe_define_function graph_function = self._create_graph_function(args, kwargs) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 2038, in _create_graph_function capture_by_value=self._capture_by_value), File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 915, in func_graph_from_py_func func_outputs = python_func(*func_args, *func_kwargs) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 320, in wrapped_fn return weak_wrapped_fn().wrapped(args, kwds) File "/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/framework/func_graph.py", line 905, in wrapper raise e.ag_error_metadata.to_exception(e) tensorflow.python.framework.errors_impl.InternalError: in converted code:

/home/xjzhang/xjtu/models/official/bert/model_training_utils.py:276 train_steps  *
    strategy.experimental_run_v2(_replicated_step, args=(next(iterator),))
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:760 experimental_run_v2
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
/home/xjzhang/xjtu/models/official/bert/model_training_utils.py:253 _replicated_step  *
    optimizer.apply_gradients(zip(grads, tvars))
/home/xjzhang/xjtu/models/official/bert/optimization.py:143 apply_gradients  *
    return super(AdamWeightDecay, self).apply_gradients(zip(grads, tvars))
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py:435 apply_gradients
    self._create_slots(var_list)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/adam.py:146 _create_slots
    self.add_slot(var, 'm')
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py:587 add_slot
    initial_value=initial_value)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:259 __call__
    return cls._variable_v2_call(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:253 _variable_v2_call
    shape=shape)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:1410 create_colocated_variable
    return next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/shared_variable_creator.py:69 create_new_variable
    v = next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:1322 creator_with_resource_vars
    return self._create_variable(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py:526 _create_variable
    values.SyncOnReadVariable, *args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/distribute_lib.py:2288 create_mirrored_variable
    value_list = real_mirrored_creator(devices, *args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/mirrored_strategy.py:518 _real_mirrored_creator
    v = next_creator(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:64 getter
    return captured_getter(captured_previous, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py:358 variable_capturing_scope
    lifted_initializer_graph=lifted_initializer_graph, **kwds)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/variables.py:261 __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py:141 __init__
    initial_value() if init_from_fn else initial_value,
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/distribute/collective_all_reduce_strategy.py:347 initial_value_fn
    collective_instance_key)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/collective_ops.py:161 broadcast_recv
    instance_key=instance_key)
/home/xjzhang/python_env/xenv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_collective_ops.py:66 collective_bcast_recv
    _six.raise_from(_core._status_to_exception(e.code, message), None)
<string>:3 raise_from
 InternalError: RecvBufResponse returned 2359296 bytes where to_tensor expected 3072 [Op:CollectiveBcastRecv]

2019-09-16 00:50:38.332468: W tensorflow/core/commonruntime/eager/context.cc:290] Unable to destroy server object, so releasing instead. Servers don't support clean shutdown.

saberkun commented 4 years ago

The mirror strategy should not work with cross-machine situation. If you just run with single gpu, does it work for you?

frankinwi commented 4 years ago

@saberkun I have tried a single machine with 2 gpus using mirror strategy and it works fine. But now, I want to try distributed training using MultiWorkerMirroredStrategy. The code In run_squad.py(line 355) strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() shows that it has supported cross-machine situation.

saberkun commented 4 years ago

Oh, I see. Let me assign to my coworker to take a look at MultiWorkerMirroredStrategy. At least for TF 2.0, we have daily regression checks for mirrorstrategy and are working on TPUStrategy.

jingli0118 commented 4 years ago

Hi, is there any process of this issue?

jvishnuvardhan commented 2 years ago

@frankinwi This is a stale issue. We are checking to see if you still need help on this issue. Can you please test the issue with the latest TensorFlow (TF2.9.1 and tf-nightly) and let us know If the issue still persists with the recent versions of TF. Thanks!