guyulongcs / CIKM2020_DMT

Deep Multifaceted Transformers for Multi-objective Ranking in Large-Scale E-commerce Recommender Systems, CIKM 2020
109 stars 25 forks source link

run the model in linux, there is a error #5

Closed takeawayls closed 3 years ago

takeawayls commented 3 years ago

nohup: redirecting stderr to stdout when I was running the model, show the error above

guyulongcs commented 3 years ago

It is not a error but a piece of notification.

takeawayls commented 3 years ago

[2021-01-15 06:29:38] start training


Current epoch num: 1


2021-01-15 06:30:05.498867: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 Traceback (most recent call last): File "./run_dnn.py", line 911, in train(wnd_conf, args['model_ckpt']) File "./run_dnn.py", line 325, in train train_order_recall_op, train_order_auc_op]) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [2048,783] vs. shape[1] = [0,32] [[node DnnModel_3/embedding_trans/concat_18 (defined at /root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] (1) Invalid argument: ConcatOp : Dimensions of inputs should match: shape[0] = [2048,783] vs. shape[1] = [0,32] [[node DnnModel_3/embedding_trans/concat_18 (defined at /root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[hash_table_Lookup_29/SelectV2/_2089]] 0 successful operations. 3 derived errors ignored.

Original stack trace for u'DnnModel_3/embedding_trans/concat_18': File "./run_dnn.py", line 911, in train(wnd_conf, args['model_ckpt']) File "./run_dnn.py", line 154, in train tower_train_logits = inf.inference(tower_batch_features, is_train=True) File "/CIKM2020_DMT-master/DMT_code/model/inference_mlp.py", line 118, in inference return self.model.inference(inputs,is_train,is_predict) File "/CIKM2020_DMT-master/DMT_code/model/net/mmoe_transformer_unbias.py", line 294, in inference features = self.embedding_trans(inputs, is_train=is_train) File "/CIKM2020_DMT-master/DMT_code/model/net/mmoe_transformer_unbias.py", line 231, in embedding_trans features = self.embedding_combiner(inputs) File "/CIKM2020_DMT-master/DMT_code/model/net/base.py", line 124, in embedding_combiner features = tf.concat(values = [features, avg_embedding], axis=1) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper return target(*args, *kwargs) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/ops/array_ops.py", line 1420, in concat return gen_array_ops.concat_v2(values=values, axis=axis, name=name) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/ops/gen_array_ops.py", line 1257, in concat_v2 "ConcatV2", values=values, axis=axis, name=name) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/root/anaconda3/envs/myconda/lib/python2.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

how can I fix the error??

LinaZhang commented 2 years ago

It is not a error but a piece of notification.

I get the same error. It is an error and the training process stop and exit

LinaZhang commented 2 years ago

[2021-10-13 00:29:33] start training


Current epoch num: 1


Traceback (most recent call last): File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [204,783] vs. shape[1] = [0,32] [[{{node DnnModel/embedding_trans/concat_18}} = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](DnnModel/embedding_trans/concat_17, DnnModel/embedding_trans/embedding_lookup_sparse_11, DnnModel/gradients/DnnModel/concat_2_grad/mod)]] [[{{node Mean_107/_1559}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7123_Mean_107", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./run_dnn.py", line 912, in train(wnd_conf, args['model_ckpt']) File "./run_dnn.py", line 326, in train train_order_recall_op, train_order_auc_op]) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [204,783] vs. shape[1] = [0,32] [[node DnnModel/embedding_trans/concat_18 (defined at /notebook/dmtfq/CIKM2020_DMT/DMT_code/model/net/base.py:124) = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](DnnModel/embedding_trans/concat_17, DnnModel/embedding_trans/embedding_lookup_sparse_11, DnnModel/gradients/DnnModel/concat_2_grad/mod)]] [[{{node Mean_107/_1559}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7123_Mean_107", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'DnnModel/embedding_trans/concat_18', defined at: File "./run_dnn.py", line 912, in train(wnd_conf, args['model_ckpt']) File "./run_dnn.py", line 154, in train tower_train_logits = inf.inference(tower_batch_features, is_train=True) File "/notebook/dmtfq/CIKM2020_DMT/DMT_code/model/inference_mlp.py", line 118, in inference return self.model.inference(inputs,is_train,is_predict) File "/notebook/dmtfq/CIKM2020_DMT/DMT_code/model/net/mmoe_transformer_unbias.py", line 294, in inference features = self.embedding_trans(inputs, is_train=is_train) File "/notebook/dmtfq/CIKM2020_DMT/DMT_code/model/net/mmoe_transformer_unbias.py", line 231, in embedding_trans features = self.embedding_combiner(inputs) File "/notebook/dmtfq/CIKM2020_DMT/DMT_code/model/net/base.py", line 124, in embedding_combiner features = tf.concat(values = [features, avg_embedding], axis=1) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1124, in concat return gen_array_ops.concat_v2(values=values, axis=axis, name=name) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1033, in concat_v2 "ConcatV2", values=values, axis=axis, name=name) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op op_def=op_def) File "/opt/conda/envs/Python3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1770, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [204,783] vs. shape[1] = [0,32] [[node DnnModel/embedding_trans/concat_18 (defined at /notebook/dmtfq/CIKM2020_DMT/DMT_code/model/net/base.py:124) = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](DnnModel/embedding_trans/concat_17, DnnModel/embedding_trans/embedding_lookup_sparse_11, DnnModel/gradients/DnnModel/concat_2_grad/mod)]] [[{{node Mean_107/_1559}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7123_Mean_107", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

run duration 193 s