I am newer to DeepSpeed-MII. When I read the code, I found that we support GTP2/Bert via the legacy API and use DeepSpeed InferenceEngine instead of InferenceEngineV2.
In that case, how does the GPT2/Bert model utilize the new features provided by MII, such as continuous batching and blocked KV cache?
Hi @Jye-525 we do not support continuous batching and block KV with GPT2/Bert models at this time. The FastGen features described in our docs are only supported for the text-gen models listed here.
Hi,
I am newer to DeepSpeed-MII. When I read the code, I found that we support GTP2/Bert via the legacy API and use DeepSpeed InferenceEngine instead of InferenceEngineV2.
In that case, how does the GPT2/Bert model utilize the new features provided by MII, such as continuous batching and blocked KV cache?