Hanging with [4,3,1] GPU assignment

@deepakn94 Do you have any cues on how to fix the hang issue using two DPs with 4GPUs and 3GPUs.

I checked the code, it seems that all data from 4GPUs is only sent to one of the 3 GPUs (I guess it is due to the self.tensor_tags which can only store one tag for one input/output node). e.g, here

I also noticed a sentence called "TODO: don't current support uneven configurations." here