vitoplantamura / OnnxStream

Lightweight inference library for ONNX files, written in C++. It can run Stable Diffusion XL 1.0 on a RPI Zero 2 (or in 298MB of RAM) but also Mistral 7B on desktops and servers. ARM, x86, WASM, RISC-V supported. Accelerated by XNNPACK.
https://yolo.vitoplantamura.com/
Other
1.86k stars 84 forks source link

llama2 7b model #89

Open ha-seungwon opened 6 days ago

ha-seungwon commented 6 days ago

Hello,

Thank you for your interesting project.

Can I use OnnxStream task in Llama2 -7b fp16 model??

vitoplantamura commented 5 days ago

hi,

currently the LLM sample application only supports "TinyLlama-1.1B-Chat-v0.3-fp16" and "Mistral-7B-Instruct-v0.2-fp16".

Vito

ha-seungwon commented 4 days ago

Hello,

So is it not possible to customise another LLM model?

Thanks

vitoplantamura commented 2 days ago

Since TinyLlama adopts the same architecture and tokenizer as Llama 2, adding Llama 2 support to src/llm.cpp should be fairly simple. It involves exporting the onnx file, running "onnxsim_large_model" on it, and finally running "onnx2txt".

Vito

ha-seungwon commented 2 days ago

Hello,

I already try but some error comes out plz help me.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch.nn as nn
import onnx

# Llama2 모델 로드
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

# Llama2 ONNX 변환용 래퍼 모델 정의
class LlamaModel(nn.Module):
    def __init__(self, model):
        super(LlamaModel, self).__init__()
        self.model = model

    def forward(self, input_ids, attention_mask, position_ids, *past_key_values):
        past_key_values = tuple(
            (past_key_values[i], past_key_values[i + 1]) for i in range(0, len(past_key_values), 2)
        )
        outputs = self.model(
            use_cache=True,
            return_dict=True,
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
        )
        pkv = outputs.past_key_values
        # logits와 각 past_key_values를 반환
        return [outputs.logits] + [item for pair in pkv for item in pair]

# 더미 입력 생성
with torch.no_grad():
    dummy_input = (
        torch.tensor([[1, 2, 3]], dtype=torch.int64),  # input_ids
        torch.tensor([[1, 1, 1]], dtype=torch.int64),  # attention_mask
        torch.tensor([[0, 1, 2]], dtype=torch.int64)   # position_ids
    )

    # 32개 레이어의 past_key_values 추가 (batch_size=1, num_heads=32, past_seq_len=4, head_dim=128)
    for _ in range(32):
        dummy_input += (torch.randn(1, 32, 4, 128, dtype=torch.float16),)  # key
        dummy_input += (torch.randn(1, 32, 4, 128, dtype=torch.float16),)  # value

    # 입력 및 출력 이름 정의
    input_names = ["input_ids", "attention_mask", "position_ids"] + [f"pkv{i}" for i in range(64)]  # 32 layers * 2 (key, value)
    output_names = ["logits"] + [f"opkv{i}" for i in range(64)]  # 32 layers * 2 (key, value)

    # ONNX 변환
    torch.onnx.export(
        LlamaModel(model),
        dummy_input,
        "./onnx_export_model/model.onnx",
        verbose=False,
        input_names=input_names,
        output_names=output_names,
        opset_version=14,
        do_constant_folding=True,
        export_params=True,
        dynamic_axes={
            "input_ids": {1: "sequence"},
            "attention_mask": {1: "sequence"},
            "position_ids": {1: "sequence"},
            **{f"pkv{i}": {2: "past_seq_len"} for i in range(64)},
        },
    )

)

after export my model and "onnxsim_large_model" > "onnx2txt"

Gather -> 68
Shape -> 37
Add -> 227
Range -> 1
Unsqueeze -> 41
Slice -> 162
Cast -> 136
Equal -> 3
And -> 1
Where -> 2
Expand -> 5
Concat -> 130
Reshape -> 129
ScatterND -> 1
Pow -> 65
ReduceMean -> 65
Sqrt -> 65
Div -> 65
Mul -> 386
MatMul -> 290
Transpose -> 161
Cos -> 1
Sin -> 1
Neg -> 64
Softmax -> 32
Sigmoid -> 32
TOTAL -> 2170

output of my onnx2txt code

my error is

image image

how can I fix it?

vitoplantamura commented 1 day ago

I will try to reproduce the problem and let you know in the next few days.

This problem is typically caused by the fact that the implementation of the HF Transformers has changed compared to the version I used to generate the TinyLlama onnx file. This causes the new onnx file to be different. A quick fix could be to use the same version of the HF Transformers to generate the new onnx file...

I'll let you know ASAP,

Vito