lm-sys / FastChat

An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Apache License 2.0
36.47k stars 4.49k forks source link

Mac M2: Memory usage growing by 1g per 4-5 tokens generated #1834

Closed ericskiff closed 10 months ago

ericskiff commented 1 year ago

When using fastchat to run the longchat model on a Mac M2, I was able to successfully generate output, but python's memory usage ballooned by about 1 gig every 5 tokens outputted until I ran out of ram at 192 gigs.

python3 -m fastchat.serve.model_cli --model-path src/longchat-13b-16k --device mps

merrymercy commented 1 year ago

cc @DachengLi1

DachengLi1 commented 1 year ago

@ericskiff thanks for sharing this! How long is the input when you get OOM (and how large is your memory, you said 192 gigs, are you using cpu?

ericskiff commented 1 year ago

Hi @DachengLi1 I'm running on an M2 mac with --device mps, my GPU is utilized when it's running inference.

The memory on these M1/M2 macs is shared ram/vram

Things seem a bit better today having just rebased from master, as the memory usage seems to level out between 212gb and 250gb, which can be mostly handled with swap. As my swap file gets to 60gb, it starts to cause out of memory warnings. I tried the new memory-optimizations branch and the pattern seemed roughly the same.

My input prompt is small, USER: What are the various skills needed to be a successful technical PM in a startup

it generates around 500 tokens at 4-6 tps and then physical memory runs out and it starts to swap to disk. It got up to 186gb memory used and 20-50gb swapped, and held there while continuing to generate at a much slower rate (around 0.5-1 tps). When it starts to swap, my GPU goes to about 60% utilization, which makes sense if it's spending time swapping memory. When I got to about 2150 tokens, the swap had reached 60gb and it seemed to throw OOM warnings constantly but I can ignore and it proceeds slowly. I assume that's my swapfile size limit currently.

If I do much else on my computer like loading a heavy webpage in firefox, I get memory warnings, but it didn't actually crash OOM in my testing this time

Thanks! -Eric

On Sun, Jul 2, 2023 at 5:20 PM Dacheng Li @.***> wrote:

@ericskiff https://github.com/ericskiff thanks for sharing this! How long is the input when you get OOM (and how large is your memory, you said 192 gigs, are you using cpu?

— Reply to this email directly, view it on GitHub https://github.com/lm-sys/FastChat/issues/1834#issuecomment-1616821876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKUS4QCRIOT42R6MMY5ILXOHQZPANCNFSM6AAAAAAZ3ARIF4 . You are receiving this because you were mentioned.Message ID: @.***>

ericskiff commented 1 year ago

FYI, this also happens with non-longchat models, I just tested with psmathur_orca_mini_3b and the same behavior happens. Memory usage balloons with context length at about the same rate

ericskiff commented 1 year ago

One last update as I ran some further tests, i just tried the psmathur_orca_mini_3b model with --device cpu and runs much slower, but memory hovers around 14-15gb consistently as it runs inference.

python3 -m fastchat.serve.cli --model-path ~/src/longchat-13b-16k --device cpu also runs slowly as expected, but memory hovers around 49-50gb during inference.

So this seems to be an issue with the --device mps metal GPU usage, perhaps when copying the context to GPU it's not deallocating?

crasm commented 1 year ago

I can corroborate this. I have an M2 Ultra (60 core). On the default vicuna model:

crasm commented 1 year ago

Just quickly reading through the PR. From the comments, it looks like the ballooning memory is due to avoiding in-place operations.

crasm commented 1 year ago

With non-inplace operations, the mps backend is not usable beyond short oneshot generations, since you'll always run out of memory. Perhaps there should be a warning in the README about this?

As far as bugs go that needed working around, is it.... uh, some of these open issues on pytorch? Looks like there's a lot to do before it's stabilized.


Also, did the obvious thing and removed the monkey-patch.. Memory is still ballooning, so I definitely do not understand what is going on here.

wendaoliuxy commented 1 year ago

i also observe the same. Memory usage blow up as more tokens generated on M2. I don't see the vRAM blowing up issue on linux + nvidia GPU. Anyone else has clue how to resolve?

surak commented 10 months ago

@ericskiff does this still happen with the latest PyTorch? Any progress here?

ericskiff commented 10 months ago

I've just done a test with Mistral-7B and this seems improved enough to be marked resolved. Tokens per second still slow with context length and memory climbs but in a much more managed way.

python3 -m fastchat.serve.cli --model-path Mistral-7B-OpenOrca --device mps

Memory usage

Before running the CLI 81.3 GB

After loading the model 100.7 GB

During / after generating a ~1000 token response 135.16