microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
MIT License
375 stars 30 forks source link

Issue with integrating with lm-eval harness #97

Open sriyachakravarthy opened 2 months ago

sriyachakravarthy commented 2 months ago

Hi! I tried evaluating 1bitLLM/bitnet_b1_58-3B from hugging face. i am getting the error ValueError: Tokenizer class BitnetTokenizer does not exist or is not currently imported. Kindly help!

### Tasks
- [ ] Pull request to the 1bitLLM Repositry
LeiWang1999 commented 2 months ago

Hi @sriyachakravarthy , Would you mind provide the scripts to reproduce?

LeiWang1999 commented 2 months ago

you may also want to checkout https://github.com/LeiWang1999/vllm-bitblas/tree/bitblas-intg with

with VllmRunner(
    "BitBLASModel/open_llama_3b_1.58bits_bitblas",
    dtype="half",
    quantization="bitblas",
    enforce_eager=False,
) as bitnet_model:
    prompt = ""
    for i in range(0, in_seq_len):
        prompt += "a "

    prompts = [prompt] * batch_size
    from vllm import SamplingParams

    sampling_params = SamplingParams(max_tokens=out_seq_len)
    torch.cuda.profiler.start()
    bitbnet_outputs = bitnet_model.generate(
        prompts, sampling_params=sampling_params
    )
    torch.cuda.profiler.stop()

Which is much more faster than the naive integration implementation.

sriyachakravarthy commented 2 months ago

Hi @sriyachakravarthy , Would you mind provide the scripts to reproduce?

Sure, here is the script.

%pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor

!lm_eval --model hf \ --model_args pretrained=BitBLASModel/open_llama_3b_1.58bits_bitblas \ --tasks hellaswag \ --device cuda:0 \ --batch_size 8

and when i am trying to use instructions from model card(https://huggingface.co/1bitLLM/bitnet_b1_58-3B), i am getting the following: $ python3 eval_ppl.py --hf_path 1bitLLM/bitnet_b1_58-3B --seqlen 2048 Traceback (most recent call last): File "eval_ppl.py", line 7, in from modeling_bitnet import BitnetForCausalLM File "/home/sriyar/bitnet/modeling_bitnet.py", line 51, in from .configuration_bitnet import BitnetConfig ImportError: attempted relative import with no known parent package $

sriyachakravarthy commented 2 months ago

you may also want to checkout https://github.com/LeiWang1999/vllm-bitblas/tree/bitblas-intg with

with VllmRunner(
    "BitBLASModel/open_llama_3b_1.58bits_bitblas",
    dtype="half",
    quantization="bitblas",
    enforce_eager=False,
) as bitnet_model:
    prompt = ""
    for i in range(0, in_seq_len):
        prompt += "a "

    prompts = [prompt] * batch_size
    from vllm import SamplingParams

    sampling_params = SamplingParams(max_tokens=out_seq_len)
    torch.cuda.profiler.start()
    bitbnet_outputs = bitnet_model.generate(
        prompts, sampling_params=sampling_params
    )
    torch.cuda.profiler.stop()

Which is much more faster than the naive integration implementation.

Sure, will do

LeiWang1999 commented 2 months ago

Hi @sriyachakravarthy , Would you mind provide the scripts to reproduce?

Sure, here is the script.

%pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@big-refactor

!lm_eval --model hf --model_args pretrained=BitBLASModel/open_llama_3b_1.58bits_bitblas --tasks hellaswag --device cuda:0 --batch_size 8

and when i am trying to use instructions from model card(https://huggingface.co/1bitLLM/bitnet_b1_58-3B), i am getting the following: $ python3 eval_ppl.py --hf_path 1bitLLM/bitnet_b1_58-3B --seqlen 2048 Traceback (most recent call last): File "eval_ppl.py", line 7, in from modeling_bitnet import BitnetForCausalLM File "/home/sriyar/bitnet/modeling_bitnet.py", line 51, in from .configuration_bitnet import BitnetConfig ImportError: attempted relative import with no known parent package $

The code there was not provided by bitblas, checkout the integration under https://github.com/microsoft/BitBLAS/tree/main/integration/BitNet .

btw, some benchmark numbers of 1.58bits vllm

<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

  |   | Token Per Second(tok/s) |   -- | -- | -- | -- model | framework | BS16IN32OUT128 | BS1IN512OUT1024 | B32IN32OUT128 openllama-3b-1.58bits | pytorch | 106.83 | 49.34 | 209.03 openllama-3b-1.58bits | pytorch-bitblas | 240.33 | 103.09 | 493.31 openllama-3b-1.58bits | vllm-bitblas | 379.25 | 117.43 | 752.55 openllama-3b-1.58bits | vllm-bitblas-cuda-graph | 2543.58 | 1621.08 | 2731.79

sriyachakravarthy commented 2 months ago

Thanks! Also, is the transformer trainer package compatible for fine tuning the model?

LeiWang1999 commented 2 months ago

@sriyachakravarthy Sry, I have no experience with that