lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
5.09k stars 408 forks source link

how to increase speed of inference #166

Open Tdrinker opened 2 months ago

Tdrinker commented 2 months ago

Hi, awesome project!

I am experimenting with using "unsloth/Meta-Llama-3.1-405B-Instruct-bnb-4bit" for inference. I am using 1 A100 GPU with 16 core CPU. However, inference time for one sentence takes 20+ minutes.

Is there any way to speed it up? Also is there anyway to process multiple text input together in a list to speed things up? Something like:

def get_output(input_text):
    input_tokens = model.tokenizer(input_text,
          return_tensors="pt", 
          return_attention_mask=False, 
          truncation=True, 
          max_length=128, 
          padding=False)

    generation_output = model.generate(
          input_tokens['input_ids'].cuda(), 
          max_new_tokens=5,
          return_dict_in_generate=True)

    output = model.tokenizer.decode(generation_output.sequences[0])
    print(output)

get_output([
    '1+1 =',
    # '20/20+19+4 =?'
    # '50%100=',
    # 'derivative of x^2'
           ])
OKHand-Zy commented 2 months ago

I have the same question. Do you have any other ideas?