We used VIT model in examples and test performance.
Test env: V100-32G,Batch size=128
We found 1 GPU or TP=2, we breakdown the time from received request to return result and rpc_sync from worker to master cost 95% of whole process.
First orange: master send data to worker (13 ms
Second orange: woker send result to master ( ~423 ms)
the 423ms is from before pipe.py/def send /trpc.rpc_sync(self.dest, rpc_queue_put, args=(self.remote_queue, data)) to first line in def rpc_queue_put(q: trpc.RRef, data: Any). just trpc.rpc_sync
We used VIT model in examples and test performance.
the 423ms is from before pipe.py/def send /trpc.rpc_sync(self.dest, rpc_queue_put, args=(self.remote_queue, data)) to first line in def rpc_queue_put(q: trpc.RRef, data: Any). just trpc.rpc_sync