I have deployed a model with mini-batch: 2 , maxi-batch:2 and cpu-threads:8. When running performance test, I observed that a sentence with 60 tokens took 1.7 seconds to translate while 2 sentences took 2.4 seconds. Adding more sentences resulted in about 0.1 more second per sentence. Why is translating 2 sentences so much slower than 1 sentence? How is sentences distributed to different cpu cores? Is it reasonable to assume that a batch translation should take about the same time as translating the longest sentence in the batch?
I have deployed a model with
mini-batch: 2
,maxi-batch:2
andcpu-threads:8
. When running performance test, I observed that a sentence with 60 tokens took 1.7 seconds to translate while 2 sentences took 2.4 seconds. Adding more sentences resulted in about 0.1 more second per sentence. Why is translating 2 sentences so much slower than 1 sentence? How is sentences distributed to different cpu cores? Is it reasonable to assume that a batch translation should take about the same time as translating the longest sentence in the batch?