turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Covert.py measurement "Killed" #504

Open GHBigD opened 2 weeks ago

GHBigD commented 2 weeks ago

I have my own homebrew Model Stock merged Wizard 8x22B base model that, when I try to do a measurement pass, ultimately bombs with a "Killed." I didn't see a way to get more verbose debugging information. The model in its full bfloat16 glory loads and runs, I just cannot make a successful measurement. As I am typing this I am wondering if it is a hardware requirement issue. I spun up a much smaller instance for the measurements.

GHBigD commented 2 weeks ago

...nope...

Pevernow commented 2 weeks ago

RAM is not enough for it.

turboderp commented 2 weeks ago

"Killed" without any explanation usually means you're out of system memory. Are you on Windows by any chance?

GHBigD commented 2 weeks ago

"Killed" without any explanation usually means you're out of system memory. Are you on Windows by any chance?

Ubuntu 22.04 LTS 50 GB RAM 48GB VRAM (Single GPU) I'm realizing now reading the hardware requirements that I was too focused on the VRAM and not the RAM. That's because I am a regular speed derp instead of turbo..

turboderp commented 2 weeks ago

For this particular model I don't think 50 GB of RAM is quite enough.

image

Biggest issue seems to be measuring the block-sparse MLP layers where I'm using about 64 GB of system RAM. The layers are very large and you need 13 versions of each tensor to produce 17 layer variants to measure. They have to reside somewhere, and saving them temporarily to disk probably isn't going to perform better than just using swap memory.

They could be packed, of course, and I think I'll look into that at some point. For now it's not a priority, though. Too many other things happening. You could try with a swap partition, or try using a measurement.json from another Wizard 8x22B model perhaps.

GHBigD commented 1 week ago

For this particular model I don't think 50 GB of RAM is quite enough.

Biggest issue seems to be measuring the block-sparse MLP layers where I'm using about 64 GB of system RAM. The layers are very large and you need 13 versions of each tensor to produce 17 layer variants to measure. They have to reside somewhere, and saving them temporarily to disk probably isn't going to perform better than just using swap memory.

They could be packed, of course, and I think I'll look into that at some point. For now it's not a priority, though. Too many other things happening. You could try with a swap partition, or try using a measurement.json from another Wizard 8x22B model perhaps.

I spun up a 150GB instance and I can confirm it is humming along fine at the moment. At the peak, it was using 18GB of VRAM and 60GB of RAM

Thanks for the help!