fast vector operation for pillar scatter

This PR introduces fast vector operations for pillar scatter module. The for loop in exisiting module make model forward very slow. Especially with larger batch sizes.

The code has been tested to check for equal outputs before and after the changes and included in this PR

latency experiments:

forward times as a function of batch_size

Overall training time before and after the change

Before -> Average per iteration training time ~2.9sec

After -> Average per iteration training time ~1.9sec

The training time per iteration reduces by 35% with this PR (for my set of parameters / dataset)

open-mmlab / OpenPCDet

fast vector operation for pillar scatter #1676