[Question] Inference latency and accuracy about WeightOnlyQuantConfig

park12sj commented 9 months ago

I'm experiencing two issues below. If there is any other way to improve the Inference speed while minimizing the loss of accuracy, please.

If use_quant is specified as false and executed in llm runtime without qt, there is an issue that is slower than baseline.(392ms -> 1332ms)
```
woq_config = WeightOnlyQuantConfig(
    use_quant=False
)
```
This seems to be because our model is basically fp16 weight, but itrex converts the model to fp32.
- baseline
```
dtype=torch.float16)), ('gpt_neox.layers.26.post_attention_layernorm.weight', tensor([1.1846, 1.1836, 1.2100,  ..., 1.2178, 1.1787, 1.2148],
dtype=torch.float16)), ('gpt_neox.layers.26.post_attention_layernorm.bias', tensor([-0.0080, -0.0298,  0.0511,  ..., -0.0035, -0.0520, -0.0039],
```
- itrex convert model https://github.com/intel/intel-extension-for-transformers/blob/7a7110e4a943088be9f9cfcba6013de394d48584/intel_extension_for_transformers/llm/runtime/graph/__init__.py#L117-L118

But, I hardcoded the ftype below to 1 and converted it to fp16, but it's also slower than baseline.(392ms -> 586ms)

conver_gptneox.py https://github.com/intel/intel-extension-for-transformers/blob/7a7110e4a943088be9f9cfcba6013de394d48584/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_gptneox.py#L72-L78

I wonder if my usage is wrong or if I can't expect performance improvement without qt.

There is an issue that the accuracy is somewhat undermined when qt is done in the a setting(compute_dtype=int8, weight_dtype=int4)

I'm sorry in advance that I can't show you the results of prompt and inference because they are sensitive information. What I'm curious about is whether qt-induced accuracy damage is common in small models of 5.8B.

I tried setting up below as a test, but I think it's a combination that's not yet supported (compute_dtype=bf16, weight_dtype=int8)

/opt/conda/envs/py310/lib/python3.10/site-packages/intel_extension_for_transformers/llm/runtime/ │
│ graph/__init__.py:87 in init                                                                     │
│                                                                                                  │
│    84 │   │   # check cache and quantization                                                     │
│    85 │   │   if use_quant:                                                                      │
│    86 │   │   │   if quant_kwargs['weight_dtype'] == "int8" and quant_kwargs['compute_dtype']    │
│ ❱  87 │   │   │   │   raise ValueError("Error: This combination (weight_dtype=int8, compute_dt   │
│    88 │   │   │   │   │   │   │   │    " is not currently supported. Please use other combinat   │
│    89 │   │   output_path = "runtime_outs"                                                       │
│    90 │   │   os.makedirs(output_path, exist_ok=True)                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Error: This combination (weight_dtype=int8, compute_dtype=bf16) is not currently supported. Please use other combinations.

zhenwei-intel commented 9 months ago

Hi @park12sj ,

There is an issue that the accuracy is somewhat undermined when qt is done in the a setting(compute_dtype=int8, weight_dtype=int4)

You can try the combination of int4+bf16 and asymmetric quantization. The quantization of int8+bf16 will be integrated soon.

zhenwei-intel commented 9 months ago

I wonder if my usage is wrong or if I can't expect performance improvement without qt.

Your usage is right!

We currently don't do much optimization for fp32 and fp16,

fp16 only saves bandwidth, and will convert to fp32 when calculating. So the performance does not reach the effect of fp16.

In the next stage, there will be kernel optimization for fp, which will bring performance benefits.

park12sj commented 9 months ago

Hi, @zhenwei-intel

For bf16, I know that calculation is also supported by avx-512 or amx instruction set. Do you have any plans to support bf16 inference on non-qt llm runtime?

For example, it would be nice to be able to do data.toile with bf16 type below.(I haven't tried it because 'numpy' has no attribute 'bfloat16') https://github.com/intel/intel-extension-for-transformers/blob/caa715fb5a6df223b3ae9a49694e48c5e984c585/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_gptneox.py#L172-L180

zhenwei-intel commented 9 months ago

Hi @park12sj ,

We will support int8(store) +bf16(computation) recently, it may take 1-2 weeks. The bottleneck of LLM on cpu is memory, so it's better to use qt-llm, if you can share your model, we can help to check the accuracy issue.

Thanks

kevinintel commented 7 months ago

int8 +bf16 is done

intel / intel-extension-for-transformers

[Question] Inference latency and accuracy about WeightOnlyQuantConfig #813