PFNet with TF multi-GPU works with 2 and 5 GPUs, but not with 4

This works fine:

CUDA_VISIBLE_DEVICES=5,6,7,8,9 singularity exec --nv /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train

...

Model: "pf_net"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_encoding (InputEncodin multiple                  0
_________________________________________________________________
sparse_hashed_nn_distance (S multiple                  17825
_________________________________________________________________
gnn_id (EncoderDecoderGNN)   multiple                  2375680
_________________________________________________________________
sequential_2 (Sequential)    (5, 6400, 8)              805384
_________________________________________________________________
sequential_3 (Sequential)    (5, 6400, 1)              801793
_________________________________________________________________
gnn_reg (EncoderDecoderGNN)  multiple                  2379776
_________________________________________________________________
sequential_4 (Sequential)    (5, 6400, 5)              807941
=================================================================
Total params: 7,188,399
Trainable params: 7,185,199
Non-trainable params: 3,200
________________________________
Epoch 1/500
 258/3200 [=>............................] - ETA: 19:18 - loss: 92.0839 - charge_loss: 15.1783 - cls_loss: 29.7701 - cos_phi_loss: 14.1853 - energy_loss: 26.3809 - eta_loss: 51.6222 - pt_loss: 3.7117 - sin_phi_loss: 11.3279 - cls_acc_unweighted: 0.7688

and so does

CUDA_VISIBLE_DEVICES=5,6 singularity exec --nv /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train
...
Epoch 1/500
  31/8000 [..............................] - ETA: 1:03:18 - loss: 124.7493 - charge_loss: 16.4039 - cls_loss: 42.0418 - cos_phi_loss: 17.1479 - energy_loss: 32.1395 - eta_loss: 130.0834 - pt_loss: 7.7385 - sin_phi_loss: 11.0048 - cls_acc_unweighted: 0.6926

while this doesn't :

CUDA_VISIBLE_DEVICES=5,6,7,8 singularity exec --nv -B /home /home/software/singularity/base.simg:latest python3 mlpf/launcher.py --model-spec parameters/cms-gnn-skipconn-v2.yaml --action train

...

Traceback (most recent call last):
  File "mlpf/launcher.py", line 26, in <module>
    main(args, yaml_path, config)
  File "/home/joosep/particleflow/mlpf/tfmodel/model_setup.py", line 755, in main
    fit_result = model.fit(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 555, in call
    outputs = execute.execute(
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 5 root error(s) found.
  (0) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_2/pf_net/sparse_hashed_nn_distance/map/while/body/_1664/replica_2/pf_net/sparse_hashed_nn_distance/map/while/map/while/cond/_8452/replica_2/pf_net/sparse_hashed_nn_distance/map/while/map/while/Less_1/_823]]
  (1) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_1/pf_net/sparse_hashed_nn_distance/SparseTensor/dense_shape/_864]]
  (2) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[replica_1/pf_net/sparse_hashed_nn_distance/SparseTensor/dense_shape/_863]]
  (3) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
         [[pf_net/gnn_reg/StatefulPartitionedCall/conv_reg1/map/while/body/_3801/conv_reg1/map/while/SparseReshape/_2974]]
  (4) Invalid argument:  Tried to stack elements of an empty list with non-fully-defined element_shape: [?,512]
         [[{{node replica_3/pf_net/gnn_id/StatefulPartitionedCall/conv_id0/map/TensorArrayV2Stack/TensorListStack}}]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_335632]

Function call stack:
train_function -> train_function -> train_function -> train_function -> train_function

jpata / particleflow

PFNet with TF multi-GPU works with 2 and 5 GPUs, but not with 4 #68