keroro824 / HashingDeepLearning

Codebase for "SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems"
MIT License
1.07k stars 169 forks source link

error in docker: shape requires a multiple of 100 #20

Open lefromage opened 4 years ago

lefromage commented 4 years ago

Steps to reproduce:

docker run -it ottovonxu/slide:v3 bash


_ /____ __/ /____ / _ _ _/ _ _/ / _ / _ | /| / / / / / / / /( )/ // / / _ / / / // / |/ |/ / // _/// //// ___/// // // \/__/|/

WARNING: You are running this container as root, which can cause new files in mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@0506217073cd:/# cd /slide/src/HashingDeepLearning/python_examples

root@0506217073cd:/slide/src/HashingDeepLearning/python_examples# python example_sampled_softmax.py WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_impl.py:1344: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version. Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default.

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.

2020-03-08 20:17:57.007308: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA Finished 0 steps. Time elapsed for last 500 batches = 0.00030684471130371094 test_acc: 0.0 ####################### Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1322, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 85771648 values, but the requested shape requires a multiple of 100 [[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_Placeholder_2_0_2, sampled_softmax_loss/concat_1/values_0)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "example_sampled_softmax.py", line 116, in main() File "example_sampled_softmax.py", line 94, in main sess.run(train_step, feed_dict={x_idxs:idxs_batch, x_vals:vals_batch, y:labels_batch}) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 900, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 85771648 values, but the requested shape requires a multiple of 100 [[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_Placeholder_2_0_2, sampled_softmax_loss/concat_1/values_0)]]

Caused by op 'Reshape', defined at: File "example_sampled_softmax.py", line 116, in main() File "example_sampled_softmax.py", line 55, in main loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(tf.transpose(W2),b2,tf.reshape(y,[-1,max_label]),layer_1,n_samples,n_classes,remove_accidental_hits=False, num_true=max_label,partition_strategy='div')) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 6113, in reshape "Reshape", tensor=tensor, shape=shape, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3392, in create_op op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1718, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 85771648 values, but the requested shape requires a multiple of 100 [[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_Placeholder_2_0_2, sampled_softmax_loss/concat_1/values_0)]]

xman commented 4 years ago

I'm able to run it further with batch size = 100, I get segmentation fault shortly after that with multiple warnings on allocating more than 10% of system memory.

its-sandy commented 4 years ago

Make sure you have pulled from master from the right remote repo