Open HGS-mbayer opened 6 days ago
Thanks for the report.
This issue appears to have been introduced in https://github.com/keras-team/keras/commit/fd8bbe2284f1ddfbc2578fce9cc5b2af35b7c927
@hertschuh can you take a look? I started debugging it, and here's my reading: the following code
except queue.Empty:
pass
is reached and leads to an infinite loop. That's because we never get to the exit condition:
if i >= num_batches - 1:
self.enqueuer.stop()
return
which is because def num_batches
returns a (correct) number that is larger than the actual number of batches drawable for the first epoch
I added a workaround at HEAD to continue training when the issue occur. It's not a definitive solution but it should help.
Training using a PyDataset and workers > 1 will hang at the end of the first epoch with Keras 3.6. This issue does not seem to occur with Keras 3.5.
Example Code
Here is a slightly modified version of https://keras.io/examples/vision/mnist_convnet/ to reproduce the issue.
Traceback
Here is the traceback I receive when interrupting the process.