显存使用增加 - Githubissues

li-plus / chatglm.cpp

C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4

MIT License

2.84k stars 327 forks source link

Open Htring opened 6 months ago

Htring commented 6 months ago

将基于lora训练后的模型量化后，使用Python Binding的方式将模型封装部署，随着请求数量（数据量万级）的增加，GPU显存为不断增加，有什么好的方式释放显存呢？