johnsmith0031 / alpaca_lora_4bit

MIT License
533 stars 84 forks source link

What made inference slower? #76

Open Ph0rk0z opened 1 year ago

Ph0rk0z commented 1 year ago

Is it because atomicAdd isn't used?

Autograd py 13b llama model GPTQv2

Output generated in 10.50 seconds (0.86 tokens/s, 9 tokens, context 1848, seed 1159591743)
Output generated in 10.26 seconds (0.88 tokens/s, 9 tokens, context 1848, seed 1402821590)
Output generated in 10.26 seconds (0.88 tokens/s, 9 tokens, context 1848, seed 1171490529)

Regular sterlind GPTQ

Output generated in 8.94 seconds (1.01 tokens/s, 9 tokens, context 1848, seed 924271439)
Output generated in 8.72 seconds (1.03 tokens/s, 9 tokens, context 1848, seed 1003244791)
Output generated in 8.72 seconds (1.03 tokens/s, 9 tokens, context 1848, seed 1551127645)
johnsmith0031 commented 1 year ago

I think the new model format would be a bit slower directly because the group index is not sequential. For example, v2 model with only group size has g_idx like: tensor([0, 0, 0, ... 31, 31, 31], device='cuda:0', dtype=torch.int32) but the other model format, the g_idx is like: tensor([21, 22, 20, ..., 5, 23, 27], device='cuda:0', dtype=torch.int32) leading to random access to the VRAM of qzeros, which would decrease performance.

Ph0rk0z commented 1 year ago

Same model though. Both V2 format. One is using the autograd.py for inference and the other one is using plain load_quant.

In the previous version the situation was reversed. quant.py struggled and autograd.py went fast.

Funny enough at peak usage my gpu (P6000) makes coil whine and now autograd doesn't make it sing anymore.

johnsmith0031 commented 1 year ago

Have you tried AMPWrapper? It can speed up inference by 5%-10% Simply add it at the end if load_model function in monkey patch

wrapper = AMPWrapper(model)
wrapper.apply_generate()
Ph0rk0z commented 1 year ago

I have not.. I will now. I don't really use the monkey patch, I forked textgen to load any model with this, not just llama.

I added it to GPTQ_loader and it slowed things down.

no xformers
No wrapper
Output generated in 16.26 seconds (2.40 tokens/s, 39 tokens, context 572, seed 1855748116)
Output generated in 15.94 seconds (2.45 tokens/s, 39 tokens, context 572, seed 1644521480)
Output generated in 15.94 seconds (2.45 tokens/s, 39 tokens, context 572, seed 443539762)

no xformers, wrapper

Output generated in 16.39 seconds (2.38 tokens/s, 39 tokens, context 572, seed 1987122874)
Output generated in 16.15 seconds (2.41 tokens/s, 39 tokens, context 572, seed 520002564)
Output generated in 16.17 seconds (2.41 tokens/s, 39 tokens, context 572, seed 1374736987)

Did:

      if not shared.args.lora or shared.lora_name == "None":
         print('Apply auto switch and half. Lora:', shared.lora_name)
         for n, m in model.named_modules():
           if isinstance(m, Autograd4bitQuantLinear):
              if (shared.args.v1 == True):
                  m.zeros = m.zeros.half()
              m.scales = m.scales.half()
              m.bias = m.bias.half()
         autograd_4bit.use_new = True
         autograd_4bit.auto_switch = True
         from amp_wrapper import AMPWrapper
         wrapper = AMPWrapper(model)
         wrapper.apply_generate()
         return model #let textgen handle the tokenizer
johnsmith0031 commented 1 year ago

Weird. Will look into it.

Ph0rk0z commented 1 year ago

Hey.. I found the reason why amp didn't work.

It is because I did not know to add model.half() like you did in the monkeypatch.

Now with this I made https://huggingface.co/notstoic/OPT-13B-Erebus-4bit-128g generate without error.

Output generated in 71.41 seconds (1.61 tokens/s, 115 tokens, context 33, seed 674562605)
Output generated in 907.23 seconds (0.19 tokens/s, 176 tokens, context 1306, seed 1)

However it is still slow.

As for alpaca-30b-4bit

regular GPTQ
Output generated in 24.21 seconds (2.27 tokens/s, 55 tokens, context 1575, seed 1)
Output generated in 24.05 seconds (2.29 tokens/s, 55 tokens, context 1575, seed 1)

autograd
Output generated in 41.81 seconds (0.89 tokens/s, 37 tokens, context 1448, seed 1)
Output generated in 41.92 seconds (0.88 tokens/s, 37 tokens, context 1448, seed 1)
Output generated in 42.01 seconds (0.88 tokens/s, 37 tokens, context 1448, seed 1)

But as a side bonus.. the 30b no longer goes OOM using amp+autograd.

Of course GPTQ can't use the model.half() and the wrapper still makes no difference.

GPTQ with wrapper

Output generated in 20.70 seconds (1.79 tokens/s, 37 tokens, context 1448, seed 1)
Output generated in 20.80 seconds (1.78 tokens/s, 37 tokens, context 1448, seed 1)
Output generated in 20.66 seconds (1.79 tokens/s, 37 tokens, context 1448, seed 1)

GPTQ without

Output generated in 20.42 seconds (1.81 tokens/s, 37 tokens, context 1448, seed 1)
Output generated in 20.62 seconds (1.79 tokens/s, 37 tokens, context 1448, seed 1)
Output generated in 20.50 seconds (1.81 tokens/s, 37 tokens, context 1448, seed 1)

Edit: So now I am doing evaluations since it's added to ooba.

For gpt-x-alpaca-13b-native-4bit-128g

Wikitext eval is

30:47,  2.83s/it using load_quant
20:00, 1.87s/it using autograd py

So 1/3 faster for computing perplexity but when doing plain inference it's slower. I'm completely confused now.

johnsmith0031 commented 1 year ago

Try to add some optimizations and the vanilla model still 8% slower than GPTQ repo triton backend. Currently the main time loss is converting float32 to float16, 4bit matmul cost 0.00017 per matmul but when add casting time it would be 0.00025. If it can be resolved I think the inference speed would increase by about 20%-30%. Made some attempt for supporting AtomicAdd on half but not work. Hope there's some ideas on this.

Ph0rk0z commented 1 year ago

I can't even try triton, so I can't speak to that. But on every single wikitext 512 context evaluation, this repo is beating cuda GPTQ by 1/3. It's just in chat and notebook where it appears slower. So I think it depends on the work you give it.

And that amp wrapper blocks evaluation when a lora is loaded.

/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Float but found Half

I did around 24 evaluations.

johnsmith0031 commented 1 year ago

Not sure what's the cause ... Maybe should try casting to correct data type before F.linear or something else. And also will keep improving cuda kernel then.

Ph0rk0z commented 1 year ago

I tried to add the new mlp attn:

Before:
Output generated in 6.59 seconds (2.12 tokens/s, 14 tokens, context 1848, seed 1140471273)
After:
Output generated in 6.61 seconds (2.12 tokens/s, 14 tokens, context 1848, seed 185848109)

Not sure if I did it right:

    #from amp_wrapper import AMPWrapper
    #model.half() #can't benchmark with lora
    for n, m in model.named_modules():
       if isinstance(m, Autograd4bitQuantLinear):
          if (shared.args.v1 == True):
              m.zeros = m.zeros.half()
          m.scales = m.scales.half()
          m.bias = m.bias.half()
    #wrapper = AMPWrapper(model)
    #wrapper.apply_generate()
    from model_attn_mlp_patch import make_quant_attn
    make_quant_attn(model)
    print(Style.BRIGHT + Fore.RED + 'Finalizing Autograd Lora:', shared.lora_names)
johnsmith0031 commented 1 year ago

You can try this:

from model_attn_mlp_patch import make_quant_attn, make_fused_mlp
make_quant_attn(model)
make_fused_mlp(model)

I tested on RTX 3090 for old 7b model, groupsize=128, get 12.5 tokens / second without lora and 11 tokens / second with lora. If not using quant_attn the speed is around 10 tokens / second. Maybe the performance also depends on the hardware.

Ph0rk0z commented 1 year ago

There is a slight speedup on pascal. I should see if wikitext evaluation times go down and by how much. But then I will have to inject the lora's instead of using PEFT.. which I guess is doable.

with 
7b
Output generated in 4.38 seconds (4.57 tokens/s, 20 tokens, context 44, seed 398347313)
Output generated in 7.91 seconds (2.78 tokens/s, 22 tokens, context 1848, seed 967713335)

13b
Output generated in 7.50 seconds (2.67 tokens/s, 20 tokens, context 44, seed 1108646784)
Output generated in 12.25 seconds (1.31 tokens/s, 16 tokens, context 1848, seed 437978839)

without 
7b
Output generated in 4.41 seconds (4.53 tokens/s, 20 tokens, context 44, seed 1501977265)
Output generated in 7.97 seconds (2.76 tokens/s, 22 tokens, context 1848, seed 579788101)

13b
Output generated in 7.54 seconds (2.65 tokens/s, 20 tokens, context 44, seed 2044570763)
Output generated in 12.35 seconds (1.30 tokens/s, 16 tokens, context 1848, seed 69632889)
Ph0rk0z commented 1 year ago

I am late to the game now but finally solved this.. it is FP16 acceleration..

Just like here with the "faster" parameter but I think you call functions like that directly. That is what slowed generation speed. https://github.com/PanQiWei/AutoGPTQ/blob/faster-cuda-no-actorder/auto_gptq/nn_modules/qlinear_old.py#L21

Ironically though, when doing a benchmark like PTB/wikitext, this method wins out by 20%. Just anything normal it becomes 1/2 speed.

johnsmith0031 commented 1 year ago

I see, it seems that my optimization on the kernel is just for the text generation task. And the reason I use faster kernel that does not support act-order is because it is fastest in small size matrix multiplication. When generating text, token is generated one by one, which means that the input size is small each time, so the optimization works. On other tasks, like finetuning, reconstructing int4 matrix to fp16 and using torch.matmul outperforms any other methods when context length > 256. Maybe just lack the optimized kernel for mid-size matrix multiplication.

johnsmith0031 commented 1 year ago

And also using the model along with gradio would decrease its performance by 1/3. Not knowing why.