wangzhaode / mnn-llm

llm deploy project based mnn.
Apache License 2.0
1.47k stars 163 forks source link

16G + 3060 6G Laptop, out of memory. #25

Closed tangpingzhou closed 9 months ago

tangpingzhou commented 1 year ago

进入网页提交第一个问题后爆显存。

使用的模型是in4,默认情况下加载模型到glm_block_26时,显存和内存都爆了,程序killed,设置swap为16GB后可以加载,加载完毕后,显存占用5.6/6.0GB,内存占用9/16GB,发送第一个问题后,爆显存,显示:

Out of memory when gamma is acquired in CudaLayerNorm. Out of memory when beta is acquired in CudaLayerNorm. Out of memory when gamma is acquired in CudaLayerNorm. Out of memory when beta is acquired in CudaLayerNorm. Can't run session because not resized
Segmentation fault

电脑系统配置: 显卡:RTX 3060 6G Laptop 内存:16G 系统:win11 + wsl2 + Ubuntu 20.04 py + torch + cuda :3.10.9 + 1.13.1 + cu117

wangzhaode commented 1 year ago

把gpu memory的值给小一点

目前int4是权值量化,加载时会执行dequnt操作所以内存上跟浮点模型一样;MNN运行时dequant功能正在开发。

tangpingzhou commented 1 year ago

把gpu memory的值给小一点

目前int4是权值量化,加载时会执行dequnt操作所以内存上跟浮点模型一样;MNN运行时dequant功能正在开发。

我改了web_demo.cpp中的gpusize = 8.0gpusize = 6.0,加载模型时内存占用11/16, 显存占用4.8/6.0, 可以在网页中正常提交问题。

但是提问时内存和显存有小幅下降,SSD满载,且GPU负载很低几乎没有工作,只会出现短暂的高负载,估计是在用SSD的swap作为内存进行推理。

wangzhaode commented 1 year ago

是的 应该是内存瓶颈导致的

wangzhaode commented 1 year ago

等我们优化一内存加载方式之后会好一些 目前开发中

tangpingzhou commented 1 year ago

等我们优化一内存加载方式之后会好一些 目前开发中

好的,辛苦您们了,非常期待!

github-actions[bot] commented 9 months ago

Marking as stale. No activity in 30 days.