Open sriyachakravarthy opened 2 months ago
Hi @sriyachakravarthy , Would you mind provide the scripts to reproduce?
you may also want to checkout https://github.com/LeiWang1999/vllm-bitblas/tree/bitblas-intg with
with VllmRunner(
"BitBLASModel/open_llama_3b_1.58bits_bitblas",
dtype="half",
quantization="bitblas",
enforce_eager=False,
) as bitnet_model:
prompt = ""
for i in range(0, in_seq_len):
prompt += "a "
prompts = [prompt] * batch_size
from vllm import SamplingParams
sampling_params = SamplingParams(max_tokens=out_seq_len)
torch.cuda.profiler.start()
bitbnet_outputs = bitnet_model.generate(
prompts, sampling_params=sampling_params
)
torch.cuda.profiler.stop()
Which is much more faster than the naive integration implementation.
Hi @sriyachakravarthy , Would you mind provide the scripts to reproduce?
Sure, here is the script.
%pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor
!lm_eval --model hf \ --model_args pretrained=BitBLASModel/open_llama_3b_1.58bits_bitblas \ --tasks hellaswag \ --device cuda:0 \ --batch_size 8
and when i am trying to use instructions from model card(https://huggingface.co/1bitLLM/bitnet_b1_58-3B), i am getting the following:
$ python3 eval_ppl.py --hf_path 1bitLLM/bitnet_b1_58-3B --seqlen 2048
Traceback (most recent call last):
File "eval_ppl.py", line 7, in
you may also want to checkout https://github.com/LeiWang1999/vllm-bitblas/tree/bitblas-intg with
with VllmRunner( "BitBLASModel/open_llama_3b_1.58bits_bitblas", dtype="half", quantization="bitblas", enforce_eager=False, ) as bitnet_model: prompt = "" for i in range(0, in_seq_len): prompt += "a " prompts = [prompt] * batch_size from vllm import SamplingParams sampling_params = SamplingParams(max_tokens=out_seq_len) torch.cuda.profiler.start() bitbnet_outputs = bitnet_model.generate( prompts, sampling_params=sampling_params ) torch.cuda.profiler.stop()
Which is much more faster than the naive integration implementation.
Sure, will do
Hi @sriyachakravarthy , Would you mind provide the scripts to reproduce?
Sure, here is the script.
%pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor
!lm_eval --model hf --model_args pretrained=BitBLASModel/open_llama_3b_1.58bits_bitblas --tasks hellaswag --device cuda:0 --batch_size 8
and when i am trying to use instructions from model card(https://huggingface.co/1bitLLM/bitnet_b1_58-3B), i am getting the following: $ python3 eval_ppl.py --hf_path 1bitLLM/bitnet_b1_58-3B --seqlen 2048 Traceback (most recent call last): File "eval_ppl.py", line 7, in from modeling_bitnet import BitnetForCausalLM File "/home/sriyar/bitnet/modeling_bitnet.py", line 51, in from .configuration_bitnet import BitnetConfig ImportError: attempted relative import with no known parent package $
The code there was not provided by bitblas, checkout the integration under https://github.com/microsoft/BitBLAS/tree/main/integration/BitNet
.
btw, some benchmark numbers of 1.58bits vllm
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
| | Token Per Second(tok/s) | -- | -- | -- | -- model | framework | BS16IN32OUT128 | BS1IN512OUT1024 | B32IN32OUT128 openllama-3b-1.58bits | pytorch | 106.83 | 49.34 | 209.03 openllama-3b-1.58bits | pytorch-bitblas | 240.33 | 103.09 | 493.31 openllama-3b-1.58bits | vllm-bitblas | 379.25 | 117.43 | 752.55 openllama-3b-1.58bits | vllm-bitblas-cuda-graph | 2543.58 | 1621.08 | 2731.79
Hi! I tried evaluating 1bitLLM/bitnet_b1_58-3B from hugging face. i am getting the error ValueError: Tokenizer class BitnetTokenizer does not exist or is not currently imported. Kindly help!