trpc.rpc_sync consumed most time

We used VIT model in examples and test performance.

Test env: V100-32G，Batch size=128
We found 1 GPU or TP=2, we breakdown the time from received request to return result and rpc_sync from worker to master cost 95% of whole process. First orange: master send data to worker (13 ms Second orange: woker send result to master ( ~423 ms)

the 423ms is from before pipe.py/def send /trpc.rpc_sync(self.dest, rpc_queue_put, args=(self.remote_queue, data)) to first line in def rpc_queue_put(q: trpc.RRef, data: Any). just trpc.rpc_sync

hpcaitech / EnergonAI

trpc.rpc_sync consumed most time #175