intel / unet

U-Net Biomedical Image Segmentation
Apache License 2.0
303 stars 124 forks source link

Training suspends in last batch of the epoch - HDF5 selections.py issue? #19

Closed sjain-stanford closed 4 years ago

sjain-stanford commented 4 years ago

Thanks for this repo. I ran the steps to download and preprocess data. It also starts training, but fails in the last batch of the first epoch. I'm using TF1.14 (not 1.15 as prescribed on the README), however I doubt that has anything to do with this error. Could this be a h5py version issue?

[UPDATE]: Before training, I set USE_KERAS_API to False, hence forcing it to pick tf.keras. Could this be the same issue referred in settings.py?

Error:

------------------------------
Fitting model with training data ...
------------------------------
Step 3, training the model started at 2020-06-16 12:51:03.365806
Train on 62930 samples, validate on 4960 samples
Epoch 1/40
62848/62930 [============================>.] - ETA: 1s - loss: 0.7630 - acc: 0.9642 - dice_coef: 0.6290 - soft_dice_coef: 0.2496
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/h5py/_hl/selections.py in select(shape, args, dsid)
     84             try:
---> 85                 int(a)
     86                 if isinstance(a, np.ndarray) and a.shape == (1,):

TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

Here's the complete error stack:

------------------------------
Fitting model with training data ...
------------------------------
Step 3, training the model started at 2020-06-16 12:51:03.365806
Train on 62930 samples, validate on 4960 samples
Epoch 1/40
62848/62930 [============================>.] - ETA: 1s - loss: 0.7630 - acc: 0.9642 - dice_coef: 0.6290 - soft_dice_coef: 0.2496
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/h5py/_hl/selections.py in select(shape, args, dsid)
     84             try:
---> 85                 int(a)
     86                 if isinstance(a, np.ndarray) and a.shape == (1,):

TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)
    345           else:
--> 346             ins_batch = slice_arrays(ins, batch_ids)
    347         except TypeError:

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/utils/generic_utils.py in slice_arrays(arrays, start, stop)
    530         start = start.tolist()
--> 531       return [None if x is None else x[start] for x in arrays]
    532     else:

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/utils/generic_utils.py in <listcomp>(.0)
    530         start = start.tolist()
--> 531       return [None if x is None else x[start] for x in arrays]
    532     else:

/data/unet-intelai/2D/data.py in __getitem__(self, key)
    158         """
--> 159         data = super().__getitem__(key)
    160         self.idx += 1

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/utils/io_utils.py in __getitem__(self, key)
    113     else:
--> 114       return self.data[idx]
    115 

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/h5py/_hl/dataset.py in __getitem__(self, args)
    552         # Perform the dataspace selection.
--> 553         selection = sel.select(self.shape, args, dsid=self.id)
    554 

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/h5py/_hl/selections.py in select(shape, args, dsid)
     89                 sel = FancySelection(shape)
---> 90                 sel[args]
     91                 return sel

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/h5py/_hl/selections.py in __getitem__(self, args)
    366                     if any(fst >= snd for fst, snd in adjacent):
--> 367                         raise TypeError("Indexing elements must be in increasing order")
    368 

TypeError: Indexing elements must be in increasing order

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-41921214fca5> in <module>
     23               validation_data=(imgs_validation, msks_validation),
     24               verbose=1, shuffle="batch",
---> 25               callbacks=model_callbacks)
     26 
     27 print("Total time elapsed for training = {} seconds".format(time.time() - start_time))

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
    778           validation_steps=validation_steps,
    779           validation_freq=validation_freq,
--> 780           steps_name='steps_per_epoch')
    781 
    782   def evaluate(self,

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)
    407           validation_in_fit=True,
    408           prepared_feed_values_from_dataset=(val_iterator is not None),
--> 409           steps_name='validation_steps')
    410       if not isinstance(val_results, list):
    411         val_results = [val_results]

/scratch/sambhavj/anaconda3/envs/tf1.14/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_arrays.py in model_iteration(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps, validation_freq, mode, validation_in_fit, prepared_feed_values_from_dataset, steps_name, **kwargs)
    346             ins_batch = slice_arrays(ins, batch_ids)
    347         except TypeError:
--> 348           raise TypeError('TypeError while preparing batch. '
    349                           'If using HDF5 input data, '
    350                           'pass shuffle="batch".')

TypeError: TypeError while preparing batch. If using HDF5 input data, pass shuffle="batch".
sjain-stanford commented 4 years ago

UPDATE: Switching to standalone keras (using the default setting USE_KERAS_API=True) overcomes this HDF5 issue otherwise encountered when using tf.keras.

ravi9 commented 4 years ago

Glad you found a solution @sjain-stanford . Closing this issue as it is resolved.