pierluigiferrari / ssd_keras

A Keras port of Single Shot MultiBox Detector
Apache License 2.0
1.86k stars 934 forks source link

"Invalid argument: Index out of range using input dim 0; input has only 0 dims" during ssd300 model training #375

Open jessicametzger opened 3 years ago

jessicametzger commented 3 years ago

I am using ssd_keras with tensorflow 1.15 backend (I was originally using tensorflow 2.20 but ran into this issue) and it throws an InvalidArgumentError the moment I start the training. It's very deep in the tensorflow backend and almost impossible to trace.

Full stack trace

As soon as I call model.fit(...) in the ssd300_training.ipynb tutorial, I get the following very long message:

Epoch 00001: LearningRateScheduler reducing learning rate to 0.001.
Epoch 1/120

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-9-1326232784a4> in <module>
     10                               validation_data=val_generator,
     11                               validation_steps=ceil(val_dataset_size/batch_size),
---> 12                               initial_epoch=initial_epoch)

~/anaconda3/envs/tf1gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
    725         max_queue_size=max_queue_size,
    726         workers=workers,
--> 727         use_multiprocessing=use_multiprocessing)
    728 
    729   def evaluate(self,

~/anaconda3/envs/tf1gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_generator.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing)
    601         shuffle=shuffle,
    602         initial_epoch=initial_epoch,
--> 603         steps_name='steps_per_epoch')
    604 
    605   def evaluate(self,

~/anaconda3/envs/tf1gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training_generator.py in model_iteration(model, data, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, validation_freq, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch, mode, batch_size, steps_name, **kwargs)
    263 
    264       is_deferred = not model._is_compiled
--> 265       batch_outs = batch_function(*batch_data)
    266       if not isinstance(batch_outs, list):
    267         batch_outs = [batch_outs]

~/anaconda3/envs/tf1gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight, reset_metrics)
   1015       self._update_sample_weight_modes(sample_weights=sample_weights)
   1016       self._make_train_function()
-> 1017       outputs = self.train_function(ins)  # pylint: disable=not-callable
   1018 
   1019     if reset_metrics:

~/anaconda3/envs/tf1gpu/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py in __call__(self, inputs)
   3474 
   3475     fetched = self._callable_fn(*array_vals,
-> 3476                                 run_metadata=self.run_metadata)
   3477     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   3478     output_structure = nest.pack_sequence_as(

~/anaconda3/envs/tf1gpu/lib/python3.7/site-packages/tensorflow_core/python/client/session.py in __call__(self, *args, **kwargs)
   1470         ret = tf_session.TF_SessionRunCallable(self._session._session,
   1471                                                self._handle, args,
-> 1472                                                run_metadata_ptr)
   1473         if run_metadata:
   1474           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

InternalError: 2 root error(s) found.
  (0) Internal: Dst tensor is not initialized.
     [[{{node loss/conv7_2/kernel/Regularizer/Square/ReadVariableOp}}]]
     [[training/SGD/gradients/gradients/conv1_1/BiasAdd_grad/BiasAddGrad/_545]]
  (1) Internal: Dst tensor is not initialized.
     [[{{node loss/conv7_2/kernel/Regularizer/Square/ReadVariableOp}}]]
0 successful operations.
0 derived errors ignored.

System info

Reproducible example

The error happens whenever I call model.fit(...) or model.fit_generator(...), where model is an ssd300 model, and where the backend is tf1. It happens whether I am using cpu or gpu. E.g. when I run the ssd300_training.ipynb tutorial, I get that error.

Sorry to open two issues at once. I've been trying to work through both of these for awhile but have found no solutions.