Closed W-O-W closed 1 week ago
I tried to replace HvdAllToAllEmbedding by BasicEmbedding,but when I mock same id to lookup embedding from BasicEmbedding and print it by tf.print,they are not same on different workers with training. Dense's kernels are same I printed.I guess grad of HvdAllToAllEmbedding not broadcasted by Horovod.
set os.environ['TF_XLA_FLAGS'] ="" can fix it.
Try xla jit level 1 for now. TFRA with XLA support will be soon available.
System information
Describe the bug fail to run demo:movielens-1m-keras-with-horovod. [1,1]:2024-09-03 17:36:49.479358: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
Code to reproduce the issue I remove LayerNormalization op.then execute this command: horovodrun -np 2 python movielens-1m-keras-with-horovod.py --mode="train" --model_dir="./model_dir" --export_dir="./export_dir" \ --steps_per_epoch=${1:-20000} --shuffle=${2:-True}
Other info / logs
[1,1]:2024-09-03 17:36:47.334739: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f830800cb80 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[1,1]:2024-09-03 17:36:47.334775: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[1,1]:WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
[1,1]:I0000 00:00:1725356207.355120 11294 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
[1,1]:2024-09-03 17:36:47.355318: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
[1,1]:2024-09-03 17:36:47.355394: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
[1,0]:2024-09-03 17:36:49.398339: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f803000a400 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[1,0]:2024-09-03 17:36:49.398366: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[1,0]:WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
[1,0]:I0000 00:00:1725356209.419932 11286 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
[1,0]:2024-09-03 17:36:49.420144: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
[1,0]:2024-09-03 17:36:49.423370: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
[1,0]:2024-09-03 17:36:49.475450: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream.
[1,1]:2024-09-03 17:36:49.479112: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,1]:
[1,1]:Stack trace for op definition:
[1,1]:dummy_file_name:10:dummy_function_name
[1,1]:
[1,0]:2024-09-03 17:36:49.479265: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,0]:
[1,0]:Stack trace for op definition:
[1,0]:dummy_file_name:10:dummy_function_name
[1,0]:
[1,1]:2024-09-03 17:36:49.479358: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,1]:
[1,1]:Stack trace for op definition:
[1,1]:dummy_file_name:10:dummy_function_name
[1,1]:
[1,1]: [[{{function_node forward_call_1818}}{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]]
[1,1]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
[1,0]:2024-09-03 17:36:49.479510: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,0]:
[1,0]:Stack trace for op definition:
[1,0]:dummy_file_name:10:dummy_function_name
[1,0]:
[1,0]: [[{{function_node forward_call_1823}}{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]]
[1,0]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
[1,1]:2024-09-03 17:36:49.482065: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,1]:
[1,1]:Stack trace for op definition:
[1,1]:dummy_file_name:10:dummy_function_name
[1,1]:
[1,0]:2024-09-03 17:36:49.482107: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,0]:
[1,0]:Stack trace for op definition:
[1,0]:dummy_file_name:10:dummy_function_name
[1,0]:
[1,1]:2024-09-03 17:36:49.482270: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,1]:
[1,1]:Stack trace for op definition:
[1,1]:dummy_file_name:10:dummy_function_name
[1,1]:
[1,1]: [[{{function_node forward_call_1818}}{{node user_DenseUnifiedEmbeddingLayer/user_DenseUnifiedEmbeddingLayer/ScatterNd}}]]
[1,1]: tf2xla conversion failed while converting cluster_6[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
[1,0]:2024-09-03 17:36:49.482339: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,0]:
[1,0]:Stack trace for op definition:
[1,0]:dummy_file_name:10:dummy_function_name
[1,0]:
[1,0]: [[{{function_node forward_call_1823}}{{node user_DenseUnifiedEmbeddingLayer/user_DenseUnifiedEmbeddingLayer/ScatterNd}}]]
[1,0]: tf2xla conversion failed while converting cluster_6[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
[1,1]:Traceback (most recent call last):
[1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 816, in
[1,1]: app.run(main)
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 308, in run
[1,1]: _run_main(main, args)
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[1,1]: sys.exit(main(argv))
[1,1]: ^^^^^^^^^^
[1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 804, in main
[1,1]: train()
[1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 704, in train
[1,1]: model.fit(dataset,
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
[1,1]: raise e.with_traceback(filtered_tb) from None
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
[1,1]: tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
[1,1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[1,1]:tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
[1,1]:
[1,1]:Detected at node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd defined at (most recent call last):
[1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 816, in
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 308, in run
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
[1,1]:
[1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 804, in main
[1,1]:
[1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 704, in train
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1807, in fit
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1401, in train_function
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1384, in step_function
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1373, in run_step
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1150, in train_step
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 590, in call
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in call
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler
[1,1]:
[1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 450, in call
[1,1]:
[1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 318, in call
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in call
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/keras/layers/embedding.py", line 564, in call
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/ops/shadow_embedding_ops.py", line 312, in embedding_lookup_unique_base
[1,1]:
[1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/ops/shadow_embedding_ops.py", line 441, in alltoall_embedding_lookup
[1,1]:
[1,1]:Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1
[1,1]:
[1,1]:Stack trace for op definition:
[1,1]:dummy_file_name:10:dummy_function_name
[1,1]:
[1,1]: [[{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]]
[1,1]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions.
[1,1]: [[cluster_5_1/xla_compile]] [Op:__inference_train_function_5638]