tloen llama-int8 issues

tloen / llama-int8

Quantized inference code for LLaMA models

GNU General Public License v3.0

1.05k stars 105 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Does this support llama2 as well?

#21 YaoJiayi opened 9 months ago
0
Producing nan Tensors

#20 Bryan-Lavender opened 1 year ago
0
CUDA out of memory

#19 fengyh3 closed 1 year ago
0
65B on multiple GPUs : CUDA out of memory with 4 x GPU RTX A5000 (24GB) / 96GB in total

#18 scampion opened 1 year ago
3
LLaMA 13B works on a single RTX 4080 16GB

#17 kcchu opened 1 year ago
1
Further detail needed - installing bitsandbytes from source

#16 chrisbward opened 1 year ago
1
Issue for bitsandbytes /// NameError: name 'cuda_setup' is not defined. Did you mean: 'CUDASetup'?

#15 kskim-phd closed 1 year ago
1
feat: webui for llama-int8

#14 soulteary opened 1 year ago
0
Assign the parameters of each layer to multiple CUDA devices automatically.

#13 lipan6461188 opened 1 year ago
0
Getting error on generation in Windows

#12 elephantpanda opened 1 year ago
4
Can 65B run on 4*32G GPU?

#11 zhongtao93 opened 1 year ago
0
Is it possible to save the smaller weights so it doesn't have to convert them each time?

#10 spullara opened 1 year ago
0
Systematic comparison of original models to int8 inferencing

#9 innokean opened 1 year ago
1
When a single A100 80G ,memory is about 96G,Error loading 65B

#8 dpyneo opened 1 year ago
3
RTX4090 CUDA out of memory.

#7 WuNein closed 1 year ago
3
Any chance to share quantized int8 7B and 13B models?

#6 progressionnetwork opened 1 year ago
0
Does 8GB able to run smallest llama model?

#5 lucasjinreal opened 1 year ago
4
Tracking issue for Mac support

#4 pannous opened 1 year ago
3
Reduce RAM consumption on loading

#3 pamparamm closed 1 year ago
0
13B - load is successful on T4, but forward pass fails

#2 deep-diver opened 1 year ago
0
On branch add_save_load

#1 calhounpaul closed 1 year ago
3