I am instantiating an LLM class for local inference. I noticed that when an OOM error happens in vllm.LLM.llm_engine.step() and I capture it, previous requests are not aborted and would mess up with my next call to LLM.generate. I was wondering what is the proper way of recovering from OOM errors during inference?
I am instantiating an LLM class for local inference. I noticed that when an OOM error happens in
vllm.LLM.llm_engine.step()
and I capture it, previous requests are not aborted and would mess up with my next call toLLM.generate
. I was wondering what is the proper way of recovering from OOM errors during inference?