neuralmagic AutoFP8 issues

neuralmagic / AutoFP8

Apache License 2.0

150 stars 17 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Add support for dbrx moe

#45 charlifu opened 1 week ago
0
How to Quantize KV Cache for the deepseek-v2 Model

#44 fengyang95 opened 3 weeks ago
0
Floating point exception (core dumped)

#43 fengyang95 closed 3 weeks ago
0
Quantize DeepSeek-Coder-V2-Instruct to W8A8(INT8)?

#42 halexan opened 1 month ago
0
User tokens masking?

#41 bilunsun closed 1 month ago
2
Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models

#40 IPostYellow opened 1 month ago
0
Qwen2-72B-Instruct-FP8 generate bad output using cutlass but be fine with torch._scaled_mm

#39 Juelianqvq closed 1 month ago
0
fp8 vs bf16 performance problem

#38 AllenDou closed 1 month ago
5
LLaMA3 report

#37 Eric-mingjie closed 1 month ago
0
CUDA out of memory when quantizing llama3.1-405b on 80GiBx8 H100 instance

#36 sfc-gh-zhwang opened 1 month ago
2
[Qusetion] Calibration datasets

#35 cyc00518 closed 1 month ago
2
Support for Vision models

#34 Syst3m1cAn0maly opened 2 months ago
0
Switch backend to use llm-compressor

#33 mgoin opened 2 months ago
0
[Feature] ADD Support for DeepSeek-V2-Chat

#32 Xu-Chen closed 2 months ago
1
Runtime Error:The weights trying to be saved contained shared tensors.

#31 IEI-mjx opened 2 months ago
3
Integration with Hugging Face transformers library

#30 SunMarc opened 2 months ago
3
DeepSeek-Coder-V2-Lite-Instruct not working when quantized to FP8 using AutoFP8

#29 Syst3m1cAn0maly closed 2 months ago
10
error: RuntimeError: The weights trying to be saved contained shared tensors

#28 AlphaINF opened 2 months ago
0
I get empty response after Quantizing LLama2 70B using AutoFP8 with calibration

#27 e3oroush closed 2 months ago
9
E5M2 or mix format trial

#26 zitgit closed 2 months ago
2
Separate `kv_scale` into `k_scale` and `v_scale`

#25 mgoin closed 2 months ago
0
When I use autofp8 to quantize the qwen32b model and test it, the accuracy drops significantly.

#24 zhangfzR closed 3 months ago
3
Can AutoFP8 quantized MOE model inferenced with vlllm?（kv_cache fp8 or kv_cache+weights fp8）

#23 IEI-mjx closed 2 months ago
3
Add automatic batching

#22 mgoin opened 3 months ago
0
Update README.md

#21 mgoin closed 3 months ago
0
Use `torch.inference_mode()` for lower memory usage during calibration

#20 mgoin closed 3 months ago
0
Memory requirements for long sequences

#19 DreamGenX closed 3 months ago
2
Bugfix for bias cloning

#18 mgoin closed 3 months ago
0
Support calibrating kv cache scales

#17 mgoin closed 3 months ago
0
Improve memory usage by properly cleaning up weights as quantized

#16 mgoin closed 3 months ago
0
CUDA out of memory. Tried to allocate 462.00 MiB. GPU

#15 liuzhenghua closed 3 months ago
6
Perform more aggressive cleanup during weight quantization and add tqdm

#14 mgoin closed 3 months ago
0
Quantization of Mixtral 8x22B

#13 nickandbro closed 3 months ago
5
Fix passing torch_dtype and device_map via model_init_kwargs

#12 tdoublep closed 3 months ago
0
Change act_scale -> input_scale

#11 mgoin closed 3 months ago
0
FP8 KV cache support

#10 HaiShaw closed 3 months ago
9
Fix numel()=0

#9 comaniac closed 4 months ago
1
Fix fp8_gemm on H100

#8 comaniac closed 4 months ago
0
Add `ignore_patterns` arg for ignoring layers

#7 mgoin closed 4 months ago
0
Tests whether FP8 computation is enabled correctly

#6 blacker521 closed 4 months ago
1
Tests whether FP8 computation is enabled correctly

#5 blacker521 closed 4 months ago
0
Refactor into package

#4 mgoin closed 4 months ago
0
fix vram leak in calibration

#3 AnyISalIn closed 4 months ago
0
Bitblas supports FP8 Inference as well

#2 LeiWang1999 opened 5 months ago
3
How to inference

#1 WuNein closed 4 months ago
3