Closed ericskiff closed 10 months ago
cc @DachengLi1
@ericskiff thanks for sharing this! How long is the input when you get OOM (and how large is your memory, you said 192 gigs, are you using cpu?
Hi @DachengLi1
I'm running on an M2 mac with --device mps
, my GPU is utilized when it's
running inference.
The memory on these M1/M2 macs is shared ram/vram
Things seem a bit better today having just rebased from master, as the memory usage seems to level out between 212gb and 250gb, which can be mostly handled with swap. As my swap file gets to 60gb, it starts to cause out of memory warnings. I tried the new memory-optimizations branch and the pattern seemed roughly the same.
My input prompt is small,
USER: What are the various skills needed to be a successful technical PM in a startup
it generates around 500 tokens at 4-6 tps and then physical memory runs out and it starts to swap to disk. It got up to 186gb memory used and 20-50gb swapped, and held there while continuing to generate at a much slower rate (around 0.5-1 tps). When it starts to swap, my GPU goes to about 60% utilization, which makes sense if it's spending time swapping memory. When I got to about 2150 tokens, the swap had reached 60gb and it seemed to throw OOM warnings constantly but I can ignore and it proceeds slowly. I assume that's my swapfile size limit currently.
If I do much else on my computer like loading a heavy webpage in firefox, I get memory warnings, but it didn't actually crash OOM in my testing this time
Thanks! -Eric
On Sun, Jul 2, 2023 at 5:20 PM Dacheng Li @.***> wrote:
@ericskiff https://github.com/ericskiff thanks for sharing this! How long is the input when you get OOM (and how large is your memory, you said 192 gigs, are you using cpu?
— Reply to this email directly, view it on GitHub https://github.com/lm-sys/FastChat/issues/1834#issuecomment-1616821876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKUS4QCRIOT42R6MMY5ILXOHQZPANCNFSM6AAAAAAZ3ARIF4 . You are receiving this because you were mentioned.Message ID: @.***>
FYI, this also happens with non-longchat models, I just tested with psmathur_orca_mini_3b and the same behavior happens. Memory usage balloons with context length at about the same rate
One last update as I ran some further tests, i just tried the psmathur_orca_mini_3b model with --device cpu
and runs much slower, but memory hovers around 14-15gb consistently as it runs inference.
python3 -m fastchat.serve.cli --model-path ~/src/longchat-13b-16k --device cpu
also runs slowly as expected, but memory hovers around 49-50gb during inference.
So this seems to be an issue with the --device mps
metal GPU usage, perhaps when copying the context to GPU it's not deallocating?
I can corroborate this. I have an M2 Ultra (60 core). On the default vicuna model:
User: Tell me a one-paragraph story about a boy who saved the world.
Just quickly reading through the PR. From the comments, it looks like the ballooning memory is due to avoiding in-place operations.
With non-inplace operations, the mps backend is not usable beyond short oneshot generations, since you'll always run out of memory. Perhaps there should be a warning in the README about this?
As far as bugs go that needed working around, is it.... uh, some of these open issues on pytorch? Looks like there's a lot to do before it's stabilized.
Also, did the obvious thing and removed the monkey-patch.. Memory is still ballooning, so I definitely do not understand what is going on here.
i also observe the same. Memory usage blow up as more tokens generated on M2. I don't see the vRAM blowing up issue on linux + nvidia GPU. Anyone else has clue how to resolve?
@ericskiff does this still happen with the latest PyTorch? Any progress here?
I've just done a test with Mistral-7B and this seems improved enough to be marked resolved. Tokens per second still slow with context length and memory climbs but in a much more managed way.
python3 -m fastchat.serve.cli --model-path Mistral-7B-OpenOrca --device mps
Memory usage
Before running the CLI 81.3 GB
After loading the model 100.7 GB
During / after generating a ~1000 token response 135.16
When using fastchat to run the longchat model on a Mac M2, I was able to successfully generate output, but python's memory usage ballooned by about 1 gig every 5 tokens outputted until I ran out of ram at 192 gigs.
python3 -m fastchat.serve.model_cli --model-path src/longchat-13b-16k --device mps