issues
search
neuralmagic
/
AutoFP8
Apache License 2.0
150
stars
17
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add support for dbrx moe
#45
charlifu
opened
1 week ago
0
How to Quantize KV Cache for the deepseek-v2 Model
#44
fengyang95
opened
3 weeks ago
0
Floating point exception (core dumped)
#43
fengyang95
closed
3 weeks ago
0
Quantize DeepSeek-Coder-V2-Instruct to W8A8(INT8)?
#42
halexan
opened
1 month ago
0
User tokens masking?
#41
bilunsun
closed
1 month ago
2
Differences in Dynamic Quantization Speedup for Varying SFT Tasks on Qwen2-72b-Instruct Models
#40
IPostYellow
opened
1 month ago
0
Qwen2-72B-Instruct-FP8 generate bad output using cutlass but be fine with torch._scaled_mm
#39
Juelianqvq
closed
1 month ago
0
fp8 vs bf16 performance problem
#38
AllenDou
closed
1 month ago
5
LLaMA3 report
#37
Eric-mingjie
closed
1 month ago
0
CUDA out of memory when quantizing llama3.1-405b on 80GiBx8 H100 instance
#36
sfc-gh-zhwang
opened
1 month ago
2
[Qusetion] Calibration datasets
#35
cyc00518
closed
1 month ago
2
Support for Vision models
#34
Syst3m1cAn0maly
opened
2 months ago
0
Switch backend to use llm-compressor
#33
mgoin
opened
2 months ago
0
[Feature] ADD Support for DeepSeek-V2-Chat
#32
Xu-Chen
closed
2 months ago
1
Runtime Error:The weights trying to be saved contained shared tensors.
#31
IEI-mjx
opened
2 months ago
3
Integration with Hugging Face transformers library
#30
SunMarc
opened
2 months ago
3
DeepSeek-Coder-V2-Lite-Instruct not working when quantized to FP8 using AutoFP8
#29
Syst3m1cAn0maly
closed
2 months ago
10
error: RuntimeError: The weights trying to be saved contained shared tensors
#28
AlphaINF
opened
2 months ago
0
I get empty response after Quantizing LLama2 70B using AutoFP8 with calibration
#27
e3oroush
closed
2 months ago
9
E5M2 or mix format trial
#26
zitgit
closed
2 months ago
2
Separate `kv_scale` into `k_scale` and `v_scale`
#25
mgoin
closed
2 months ago
0
When I use autofp8 to quantize the qwen32b model and test it, the accuracy drops significantly.
#24
zhangfzR
closed
3 months ago
3
Can AutoFP8 quantized MOE model inferenced with vlllm?(kv_cache fp8 or kv_cache+weights fp8)
#23
IEI-mjx
closed
2 months ago
3
Add automatic batching
#22
mgoin
opened
3 months ago
0
Update README.md
#21
mgoin
closed
3 months ago
0
Use `torch.inference_mode()` for lower memory usage during calibration
#20
mgoin
closed
3 months ago
0
Memory requirements for long sequences
#19
DreamGenX
closed
3 months ago
2
Bugfix for bias cloning
#18
mgoin
closed
3 months ago
0
Support calibrating kv cache scales
#17
mgoin
closed
3 months ago
0
Improve memory usage by properly cleaning up weights as quantized
#16
mgoin
closed
3 months ago
0
CUDA out of memory. Tried to allocate 462.00 MiB. GPU
#15
liuzhenghua
closed
3 months ago
6
Perform more aggressive cleanup during weight quantization and add tqdm
#14
mgoin
closed
3 months ago
0
Quantization of Mixtral 8x22B
#13
nickandbro
closed
3 months ago
5
Fix passing torch_dtype and device_map via model_init_kwargs
#12
tdoublep
closed
3 months ago
0
Change act_scale -> input_scale
#11
mgoin
closed
3 months ago
0
FP8 KV cache support
#10
HaiShaw
closed
3 months ago
9
Fix numel()=0
#9
comaniac
closed
4 months ago
1
Fix fp8_gemm on H100
#8
comaniac
closed
4 months ago
0
Add `ignore_patterns` arg for ignoring layers
#7
mgoin
closed
4 months ago
0
Tests whether FP8 computation is enabled correctly
#6
blacker521
closed
4 months ago
1
Tests whether FP8 computation is enabled correctly
#5
blacker521
closed
4 months ago
0
Refactor into package
#4
mgoin
closed
4 months ago
0
fix vram leak in calibration
#3
AnyISalIn
closed
4 months ago
0
Bitblas supports FP8 Inference as well
#2
LeiWang1999
opened
5 months ago
3
How to inference
#1
WuNein
closed
4 months ago
3