Closed SevenMpp closed 10 months ago
Since sse-starlette is not actively involved in memory management I do not think that your issue is related with sse-starlette. However, if you have reason to disagree, please re-open your issue with your reasoning.
I thought sse-starlettle is a keep-lived connection, so when streaming a request, after the response is completed, the memory used to process the request is still not reclaimed, so I think it is relevant, what do you think??
The streaming service was built with Fastapi, postman or program or curl are used for testing. When the request is completely responded to, it still occupies GPU memory resources and has not been reclaimed.How do I solve???
The following code:
@app.post("/v1/chat/completions", response_model=ChatCompletionResponse) async def create_chat_completion(request: ChatCompletionRequest): global model, tokenizer
async def predict(query: str, history: List[List[str]], model_id: str): global model, tokenizer