tyagi-iiitv / PointPillars

GNU General Public License v3.0
105 stars 47 forks source link

Multi GPU support #20

Closed ma7555 closed 3 years ago

ma7555 commented 3 years ago

Using MirroredStrategy for distributed training results in an error

File "C:\Users\***\anaconda3\envs\tf\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1619, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: The outer 2 dimensions of indices.shape=[2,12000,3] must match the outer 2 dimensions of updates.shape=[1,12000,64]: Dimension 0 in both shapes must be equal, but are 2 and 1. Shapes are [2,12000] and [1,12000]. for 'pillars/scatter_nd/ScatterNd' (op: 'ScatterNd') with input shapes: [2,12000,3], [1,12000,64], [4].
ma7555 commented 3 years ago

Issue explained:

file networks.py hardcodes batch_size into the correct_batch_indices function https://github.com/tyagi-iiitv/PointPillars/blob/cc0c4be0ca0bdd481c809673305a69ef116b02c4/network.py#L27

This results into wrong dimensinality during ditributed training as batch_size is actually divided by number of GPUs or replicas during .fit()

I have been thinking for a while about changes in this function but nothing worked. This is what I tried

    def correct_batch_indices(tensor):
        seq = tf.range(tf.shape(tensor)[0])
        array = tf.Variable(lambda: tf.zeros_like(tensor))
        array = array[seq, :, 0].assign(seq)
        return tf.math.add(tensor, array)

Using a tf.Variable inside a lambda is a bad idea, if you can suggest something better let me know

ma7555 commented 3 years ago

fixed for network.py, will need to look at the generator tomorrow too

ma7555 commented 3 years ago

PR for fix https://github.com/tyagi-iiitv/PointPillars/pull/25