rossumai / keras-multi-gpu

Multi-GPU data-parallel training in Keras
MIT License
77 stars 20 forks source link

Shape [-1] has negative dimensions #1

Open bzamecnik opened 7 years ago

bzamecnik commented 7 years ago

Running on 2 GPUs (GTX 1070):

CUDA_VISIBLE_DEVICES=0,1 python data_parallel_mnist_cnn.py
Train on 60000 samples, validate on 10000 samples
Epoch 1/10
2017-08-10 14:55:47.483599: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-10 14:55:47.483631: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-10 14:55:48.831409: W tensorflow/core/framework/op_kernel.cc:1148] Invalid argument: Shape [-1,-1] has negative dimensions
2017-08-10 14:55:48.831460: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: Shape [-1,-1] has negative dimensions
     [[Node: replica_1_1/model_1_target = Placeholder[dtype=DT_FLOAT, shape=[?,?], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
2017-08-10 14:55:48.849021: W tensorflow/core/framework/op_kernel.cc:1148] Invalid argument: Shape [-1,-1] has negative dimensions
2017-08-10 14:55:48.849064: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: Shape [-1,-1] has negative dimensions
     [[Node: replica_0_1/model_1_target = Placeholder[dtype=DT_FLOAT, shape=[?,?], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
2017-08-10 14:55:48.865190: W tensorflow/core/framework/op_kernel.cc:1148] Invalid argument: Shape [-1] has negative dimensions
2017-08-10 14:55:48.865233: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Invalid argument: Shape [-1] has negative dimensions
     [[Node: replica_0_1/model_1_sample_weights = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Traceback (most recent call last):
  File "/Users/bzamecnik/anaconda/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/Users/bzamecnik/anaconda/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/Users/bzamecnik/anaconda/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/Users/bzamecnik/anaconda/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape [-1] has negative dimensions
     [[Node: replica_0_1/model_1_sample_weights = Placeholder[dtype=DT_FLOAT, shape=[?], _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
bzamecnik commented 7 years ago

It appears for placeholders model_1_sample_weights and model_1_targets. It seems all calls to K.placeholder have shape fully specified, while in these two cases only ndim is given, which results in shape (None) or (None, None).

bzamecnik commented 7 years ago

A hypothesis is that it might be caused by mismatch of size of predictions and targets within each replica (tower). Now we provide inputs and targets of full mini-batch size, but extract slices and compute tower predictions of of sub-batch size. Since we compute the loss within each tower (compared to the baseline solution - make_parallel()) the size of predictions and targets might be different.

We would have to slice also the targets/sample weights. Or another solution would be perform sub-batch slicing in Keras and feed slices for each tower separately.

bzamecnik commented 7 years ago

It seems that there are some placeholders that are not assigned in session.run() via feed_dict.

Placeholders:

>>> [op for op in g.get_operations() if op.type == 'Placeholder']
[<tf.Operation 'input_1' type=Placeholder>,
 <tf.Operation 'dropout_1/keras_learning_phase' type=Placeholder>,
 <tf.Operation 'replica_0_1/model_1_sample_weights' type=Placeholder>,
 <tf.Operation 'replica_0_1/model_1_target' type=Placeholder>,
 <tf.Operation 'replica_1_1/model_1_sample_weights' type=Placeholder>,
 <tf.Operation 'replica_1_1/model_1_target' type=Placeholder>,
 <tf.Operation 'concatenate_1_sample_weights' type=Placeholder>,
 <tf.Operation 'concatenate_1_target' type=Placeholder>]

The above error is raised when a placeholder with dynamic dimensions (marked as ? or None) is not assigned a value. Error from incompatible shapes looks differently (see a small experiment).

In Model.compile() placeholders for sample_weights and targets are created. Since we call compile for the replicas and also for the wrapping model we make several sets of these placeholders. However, during training we call fit() only on the wrapper model and thus do not feed values to the the placeholders in the replica models.