tensorflow / recommenders-addons

Additional utils and helpers to extend TensorFlow when build recommendation systems, contributed and maintained by SIG Recommenders.
Apache License 2.0
596 stars 136 forks source link

fail to run demo:movielens-1m-keras-with-horovod #458

Closed W-O-W closed 1 week ago

W-O-W commented 2 months ago

System information

Describe the bug fail to run demo:movielens-1m-keras-with-horovod. [1,1]:2024-09-03 17:36:49.479358: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1

Code to reproduce the issue I remove LayerNormalization op.then execute this command: horovodrun -np 2 python movielens-1m-keras-with-horovod.py --mode="train" --model_dir="./model_dir" --export_dir="./export_dir" \ --steps_per_epoch=${1:-20000} --shuffle=${2:-True}

Other info / logs

[1,1]:2024-09-03 17:36:47.334739: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f830800cb80 initialized for platform Host (this does not guarantee that XLA will be used). Devices: [1,1]:2024-09-03 17:36:47.334775: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version [1,1]:WARNING: All log messages before absl::InitializeLog() is called are written to STDERR [1,1]:I0000 00:00:1725356207.355120 11294 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. [1,1]:2024-09-03 17:36:47.355318: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,1]:2024-09-03 17:36:47.355394: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,0]:2024-09-03 17:36:49.398339: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f803000a400 initialized for platform Host (this does not guarantee that XLA will be used). Devices: [1,0]:2024-09-03 17:36:49.398366: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version [1,0]:WARNING: All log messages before absl::InitializeLog() is called are written to STDERR [1,0]:I0000 00:00:1725356209.419932 11286 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. [1,0]:2024-09-03 17:36:49.420144: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,0]:2024-09-03 17:36:49.423370: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,0]:2024-09-03 17:36:49.475450: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,1]:2024-09-03 17:36:49.479112: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,0]:2024-09-03 17:36:49.479265: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,0]: [1,0]:Stack trace for op definition: [1,0]:dummy_file_name:10:dummy_function_name [1,0]: [1,1]:2024-09-03 17:36:49.479358: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,1]: [[{{function_node forward_call_1818}}{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,1]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,0]:2024-09-03 17:36:49.479510: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,0]: [1,0]:Stack trace for op definition: [1,0]:dummy_file_name:10:dummy_function_name [1,0]: [1,0]: [[{{function_node forward_call_1823}}{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,0]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,1]:2024-09-03 17:36:49.482065: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,0]:2024-09-03 17:36:49.482107: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,0]: [1,0]:Stack trace for op definition: [1,0]:dummy_file_name:10:dummy_function_name [1,0]: [1,1]:2024-09-03 17:36:49.482270: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,1]: [[{{function_node forward_call_1818}}{{node user_DenseUnifiedEmbeddingLayer/user_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,1]: tf2xla conversion failed while converting cluster_6[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,0]:2024-09-03 17:36:49.482339: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,0]: [1,0]:Stack trace for op definition: [1,0]:dummy_file_name:10:dummy_function_name [1,0]: [1,0]: [[{{function_node forward_call_1823}}{{node user_DenseUnifiedEmbeddingLayer/user_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,0]: tf2xla conversion failed while converting cluster_6[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,1]:Traceback (most recent call last): [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 816, in [1,1]: app.run(main) [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 308, in run [1,1]: _run_main(main, args) [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main [1,1]: sys.exit(main(argv)) [1,1]: ^^^^^^^^^^ [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 804, in main [1,1]: train() [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 704, in train [1,1]: model.fit(dataset, [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler [1,1]: raise e.with_traceback(filtered_tb) from None [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute [1,1]: tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, [1,1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [1,1]:tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error: [1,1]: [1,1]:Detected at node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd defined at (most recent call last): [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 816, in [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 308, in run [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main [1,1]: [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 804, in main [1,1]: [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 704, in train [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1807, in fit [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1401, in train_function [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1384, in step_function [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1373, in run_step [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1150, in train_step [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 590, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler [1,1]: [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 450, in call [1,1]: [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 318, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/keras/layers/embedding.py", line 564, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/ops/shadow_embedding_ops.py", line 312, in embedding_lookup_unique_base [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/ops/shadow_embedding_ops.py", line 441, in alltoall_embedding_lookup [1,1]: [1,1]:Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,1]: [[{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,1]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,1]: [[cluster_5_1/xla_compile]] [Op:__inference_train_function_5638]

W-O-W commented 2 months ago

I tried to replace HvdAllToAllEmbedding by BasicEmbedding,but when I mock same id to lookup embedding from BasicEmbedding and print it by tf.print,they are not same on different workers with training. Dense's kernels are same I printed.I guess grad of HvdAllToAllEmbedding not broadcasted by Horovod.

W-O-W commented 2 months ago

set os.environ['TF_XLA_FLAGS'] ="" can fix it.

MoFHeka commented 1 week ago

Try xla jit level 1 for now. TFRA with XLA support will be soon available.