microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
408 stars 92 forks source link

try to build model gemma2 ,but failed. #692

Open iwaitu opened 1 month ago

iwaitu commented 1 month ago
(pytorch) root@DESKTOP-RDS3VMA:~/work/gemma2# python3 builder.py -m google/gemma-2-27b-it -o ~/work/gemma2/gemma2onnx -p fp16 -e cuda -c ~/work/gemma2/temp
Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {}
/root/miniconda3/envs/pytorch/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:950: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
GroupQueryAttention (GQA) is used in this model.
/root/miniconda3/envs/pytorch/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:469: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 4/4 [01:34<00:00, 23.71s/it]
Reading embedding layer
Killed

image

when it run to line "tensor = numpy_helper.from_array(np_data)" , it just crash with no error. I added try-except block try to catch some error, but I got nothing.

@kunal-vaishnavi

iwaitu commented 1 month ago

after I edited builder.py line 1607 like this:

model = AutoModelForCausalLM.from_pretrained(self.model_name_or_path, use_auth_token=True, trust_remote_code=True, torch_dtype=torch.float16, **extra_kwargs)

It could go continue. But still crash with no any error message at line 1643:

self.make_lm_head(module)

this is log.

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:35<00:00, 23.83s/it]
Reading embedding layer
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
Reading decoder layer 3
Reading decoder layer 4
Reading decoder layer 5
Reading decoder layer 6
Reading decoder layer 7
Reading decoder layer 8
Reading decoder layer 9
Reading decoder layer 10
Reading decoder layer 11
Reading decoder layer 12
Reading decoder layer 13
Reading decoder layer 14
Reading decoder layer 15
Reading decoder layer 16
Reading decoder layer 17
Reading decoder layer 18
Reading decoder layer 19
Reading decoder layer 20
Reading decoder layer 21
Reading decoder layer 22
Reading decoder layer 23
Reading decoder layer 24
Reading decoder layer 25
Reading decoder layer 26
Reading decoder layer 27
Reading decoder layer 28
Reading decoder layer 29
Reading decoder layer 30
Reading decoder layer 31
Reading decoder layer 32
Reading decoder layer 33
Reading decoder layer 34
Reading decoder layer 35
Reading decoder layer 36
Reading decoder layer 37
Reading decoder layer 38
Reading decoder layer 39
Reading decoder layer 40
Reading decoder layer 41
Reading final norm
Reading LM head
kunal-vaishnavi commented 1 month ago

The Killed message indicates you're running out of memory. Loading the PyTorch model with torch_dtype=torch.float16 can definitely help reduce memory usage, but the Gemma 2 embeddings are quite large and running out of memory is still a common issue. Do you have a larger machine you can use?

I am working on an improved method to load these large models using mmap instead to avoid out-of-memory errors such as this one and the above ones you've faced. With mmap, the model builder can then adapt to the machine's memory constraints.

iwaitu commented 1 month ago

thanks for your answer, My computer is 64G RAM, I try to do this on the other workstation with 256G RAM + Nvidia A100 now. I hope it can work .

iwaitu commented 1 month ago

It work for me. spent more than 200GB RAM for this, convert gemma 2 27b to fp16 onnx for genai.

iwaitu commented 1 month ago

I try to run this model but failed. image

Another problem, the model.onnx.data is large than 52GB, Is any way I can split this file for small size like 10Gb per file ? @kunal-vaishnavi

kunal-vaishnavi commented 1 month ago

It work for me. spent more than 200GB RAM for this, convert gemma 2 27b to fp16 onnx for genai.

Great to hear that it worked! The mmap work will avoid needing to spend 200GB of RAM in the future.

I try to run this model but failed.

I added the gemma2 model type in ONNX Runtime GenAI as part of this PR. The change will be part of the next package release. In the meantime, you can either change the model type from gemma2 to gemma in genai_config.json or you can build ONNX Runtime GenAI from source.

Another problem, the model.onnx.data is large than 52GB, Is any way I can split this file for small size like 10Gb per file ?

You can save the weights as one file or save each weight that is larger than a certain size threshold in its own file using the ONNX load_model and save_model methods. Here is an example for the latter scenario.

import onnx

model = onnx.load_model("/path/to/model.onnx", load_external_data=True)
onnx.save_model(
    model,
    "/path/to/model_new.onnx",
    save_as_external_data=True,
    all_tensors_to_one_file=False,
    size_threshold=1024,
    convert_attribute=False,
)

Note that if you change the filename when saving the ONNX model, you will need to update the filename in genai_config.json as well.

iwaitu commented 1 month ago

I found a very strange phenomenon. I converted gemm2-9b and gemma2-27b into ONNX using the same instructions, and there were no errors during the process. However, when testing with the following code, gemma2-9b-cuda-onnx works fine, but the result generated by gemma2-27b-cuda-onnx is empty.

python3 builder.py -m shenzhi-wang/Gemma-2-27B-Chinese-Chat -o output27b -p fp16 -e cuda -c temp --extra_options filename=gemma-2-27b-cuda-fp16.onnx

python3 builder.py -m shenzhi-wang/Gemma-2-9B-Chinese-Chat -o output9b -p fp16 -e cuda -c temp --extra_options filename=gemma-2-9b-cuda-fp16.onnx

this is test code:

import onnxruntime_genai as og

model = og.Model('cuda/gemma-2-27b')
#model = og.Model('cuda/gemma-2-9b')
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 2048

chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'

text = input("Input: ")
if not text:
   print("Error, input cannot be empty")
   exit

prompt = f'{chat_template.format(input=text)}'

input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.set_search_options(**search_options)
params.input_ids = input_tokens
generator = og.Generator(model, params)

print("Output: ", end='', flush=True)

try:
   while not generator.is_done():
     generator.compute_logits()
     generator.generate_next_token()

     new_token = generator.get_next_tokens()[0]
     print(tokenizer_stream.decode(new_token), end='', flush=True)
except KeyboardInterrupt:
    print("  --control+c pressed, aborting generation--")

print()
del generator

I debug this code, I found new_token always was 0 ,when I use cuda/gemma-2-27b

@kunal-vaishnavi

iwaitu commented 1 month ago

I also tried using C# to infer the converted model. The gemma-2-9b works fine, but the gemma-2-27b, similar to Python, returns empty results. So, I re-ran the command:

python3 builder.py -m google/gemma-2-27b-it -o output -p fp16 -e cuda -c temp --extra_options filename=gemma-2-27b-cuda-fp16.onnx

for conversion. I suspected it might have been an issue with the fine-tuned model before. However, after re-converting the original Google model, I found the problem still persists. It seems there is a bug in builder.py when converting large models.

kunal-vaishnavi commented 1 month ago

I'm able to reproduce this behavior with Gemma-2 27B. I took a quick glance and the ONNX models produced by the model builder look fine to me. I'll investigate more closely and get back to you.

xenova commented 1 month ago

Another thing to consider is that gemma/gemma2 uses tied weights, meaning the model builder unnecessarily duplicates these weights, with embedding layer staying fp16, and lm_head converted to q4. Ideally, both should have the same dtype (e.g., fp16), with lm_head being the transpose of the embedding layer.

kunal-vaishnavi commented 3 weeks ago

The behavior appears to be happening because logit soft-capping is not used in the GroupQueryAttention ops. The model is producing NaNs as output, and this causes ONNX Runtime GenAI to return zeros as the next tokens. Because the GroupQueryAttention ops use flash attention and the original Gemma-2 team "observed very minor differences when soft-capping is removed during inference" when using flash attention, logit soft-capping was not added to GroupQueryAttention.

Since NaNs are appearing in the model's output, we will add logit soft-capping to GroupQueryAttention in ONNX Runtime. Once added, I will let you know.

iwaitu commented 1 week ago

I’m really looking forward to it.