How does GPT2/Bert models utilize continuous batching feature in MII?

microsoft / DeepSpeed-MII

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Apache License 2.0

1.84k stars 173 forks source link

How does GPT2/Bert models utilize continuous batching feature in MII? #417

Open Jye-525 opened 7 months ago

Jye-525 commented 7 months ago

Hi,

I am newer to DeepSpeed-MII. When I read the code, I found that we support GTP2/Bert via the legacy API and use DeepSpeed InferenceEngine instead of InferenceEngineV2.

In that case, how does the GPT2/Bert model utilize the new features provided by MII, such as continuous batching and blocked KV cache?

mrwyattii commented 7 months ago

Hi @Jye-525 we do not support continuous batching and block KV with GPT2/Bert models at this time. The FastGen features described in our docs are only supported for the text-gen models listed here.