Open dryglicki opened 1 week ago
This job will fail with this error (that I've truncated):
Please disable traceback filtering (keras.config.disable_traceback_filtering()
) and rerun. I'd like to see where exactly this fails
@fchollet Will do. My response time will be a little slow, since we've been affected by Milton and had to evacuate. Just submitted the job on the HPC.
@fchollet I finally have error output, after two days of sitting in the queue + hurricane travel. It is still failing with the same error: dict type conversion. I put K.config.disable_traceback_filtering()
at the top after the imports but before def main()
, if that matters at all.
Hello. I wonder if I've stumbled on another corner case. Re-producing in code is going to be challenging for me, but I may as well give it a shot with enough pieces.
Versions
Keras: 3.5.0 Tensorflow: 2.17.0
Environment
Slurm HPC
Situation
I am trying to use
MirroredStrategy
andMultiWorkerMirroredStrategy
for parallel runs on an HPC that uses Slurm as the scheduler. The if-block that decides the strategy looks like this:I use a
PyDataset
class to get the data into the model. The return from that class is:[inputs]
is a dictionary, and this is where I'm running into trouble. I want to be clear about this: single node, multi-gpu withMirroredStrategy
works just fine; serial works just fine. In the Slurm submission script, the job is being run like this:This job will fail with this error (that I've truncated):
The function itself looks like this, if I'm tracing back correctly:
So what appears to be happening is that I have a nested dictionary here, and
MultiWorkerMirroredStrategy
is addingPerReplica
as a container.I know you've all said that you aren't supporting nested dictionaries or lists (I can't recall the specifics), but what am I supposed to do here?
As an addendum, if this is TF and not Keras issue, please let me know.