Open iwaitu opened 4 months ago
after I edited builder.py line 1607 like this:
model = AutoModelForCausalLM.from_pretrained(self.model_name_or_path, use_auth_token=True, trust_remote_code=True, torch_dtype=torch.float16, **extra_kwargs)
It could go continue. But still crash with no any error message at line 1643:
self.make_lm_head(module)
this is log.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:35<00:00, 23.83s/it]
Reading embedding layer
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
Reading decoder layer 3
Reading decoder layer 4
Reading decoder layer 5
Reading decoder layer 6
Reading decoder layer 7
Reading decoder layer 8
Reading decoder layer 9
Reading decoder layer 10
Reading decoder layer 11
Reading decoder layer 12
Reading decoder layer 13
Reading decoder layer 14
Reading decoder layer 15
Reading decoder layer 16
Reading decoder layer 17
Reading decoder layer 18
Reading decoder layer 19
Reading decoder layer 20
Reading decoder layer 21
Reading decoder layer 22
Reading decoder layer 23
Reading decoder layer 24
Reading decoder layer 25
Reading decoder layer 26
Reading decoder layer 27
Reading decoder layer 28
Reading decoder layer 29
Reading decoder layer 30
Reading decoder layer 31
Reading decoder layer 32
Reading decoder layer 33
Reading decoder layer 34
Reading decoder layer 35
Reading decoder layer 36
Reading decoder layer 37
Reading decoder layer 38
Reading decoder layer 39
Reading decoder layer 40
Reading decoder layer 41
Reading final norm
Reading LM head
The Killed
message indicates you're running out of memory. Loading the PyTorch model with torch_dtype=torch.float16
can definitely help reduce memory usage, but the Gemma 2 embeddings are quite large and running out of memory is still a common issue. Do you have a larger machine you can use?
I am working on an improved method to load these large models using mmap
instead to avoid out-of-memory errors such as this one and the above ones you've faced. With mmap
, the model builder can then adapt to the machine's memory constraints.
thanks for your answer, My computer is 64G RAM, I try to do this on the other workstation with 256G RAM + Nvidia A100 now. I hope it can work .
It work for me. spent more than 200GB RAM for this, convert gemma 2 27b to fp16 onnx for genai.
I try to run this model but failed.
Another problem, the model.onnx.data is large than 52GB, Is any way I can split this file for small size like 10Gb per file ? @kunal-vaishnavi
It work for me. spent more than 200GB RAM for this, convert gemma 2 27b to fp16 onnx for genai.
Great to hear that it worked! The mmap
work will avoid needing to spend 200GB of RAM in the future.
I try to run this model but failed.
I added the gemma2
model type in ONNX Runtime GenAI as part of this PR. The change will be part of the next package release. In the meantime, you can either change the model type from gemma2
to gemma
in genai_config.json
or you can build ONNX Runtime GenAI from source.
Another problem, the model.onnx.data is large than 52GB, Is any way I can split this file for small size like 10Gb per file ?
You can save the weights as one file or save each weight that is larger than a certain size threshold in its own file using the ONNX load_model
and save_model
methods. Here is an example for the latter scenario.
import onnx
model = onnx.load_model("/path/to/model.onnx", load_external_data=True)
onnx.save_model(
model,
"/path/to/model_new.onnx",
save_as_external_data=True,
all_tensors_to_one_file=False,
size_threshold=1024,
convert_attribute=False,
)
Note that if you change the filename when saving the ONNX model, you will need to update the filename in genai_config.json
as well.
I found a very strange phenomenon. I converted gemm2-9b and gemma2-27b into ONNX using the same instructions, and there were no errors during the process. However, when testing with the following code, gemma2-9b-cuda-onnx works fine, but the result generated by gemma2-27b-cuda-onnx is empty.
python3 builder.py -m shenzhi-wang/Gemma-2-27B-Chinese-Chat -o output27b -p fp16 -e cuda -c temp --extra_options filename=gemma-2-27b-cuda-fp16.onnx
python3 builder.py -m shenzhi-wang/Gemma-2-9B-Chinese-Chat -o output9b -p fp16 -e cuda -c temp --extra_options filename=gemma-2-9b-cuda-fp16.onnx
this is test code:
import onnxruntime_genai as og
model = og.Model('cuda/gemma-2-27b')
#model = og.Model('cuda/gemma-2-9b')
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 2048
chat_template = '<|user|>\n{input} <|end|>\n<|assistant|>'
text = input("Input: ")
if not text:
print("Error, input cannot be empty")
exit
prompt = f'{chat_template.format(input=text)}'
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
params.input_ids = input_tokens
generator = og.Generator(model, params)
print("Output: ", end='', flush=True)
try:
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end='', flush=True)
except KeyboardInterrupt:
print(" --control+c pressed, aborting generation--")
print()
del generator
I debug this code, I found new_token always was 0 ,when I use cuda/gemma-2-27b
@kunal-vaishnavi
I also tried using C# to infer the converted model. The gemma-2-9b works fine, but the gemma-2-27b, similar to Python, returns empty results. So, I re-ran the command:
python3 builder.py -m google/gemma-2-27b-it -o output -p fp16 -e cuda -c temp --extra_options filename=gemma-2-27b-cuda-fp16.onnx
for conversion. I suspected it might have been an issue with the fine-tuned model before. However, after re-converting the original Google model, I found the problem still persists. It seems there is a bug in builder.py when converting large models.
I'm able to reproduce this behavior with Gemma-2 27B. I took a quick glance and the ONNX models produced by the model builder look fine to me. I'll investigate more closely and get back to you.
Another thing to consider is that gemma/gemma2 uses tied weights, meaning the model builder unnecessarily duplicates these weights, with embedding layer staying fp16, and lm_head converted to q4. Ideally, both should have the same dtype (e.g., fp16), with lm_head being the transpose of the embedding layer.
The behavior appears to be happening because logit soft-capping is not used in the GroupQueryAttention
ops. The model is producing NaNs as output, and this causes ONNX Runtime GenAI to return zeros as the next tokens. Because the GroupQueryAttention
ops use flash attention and the original Gemma-2 team "observed very minor differences when soft-capping is removed during inference" when using flash attention, logit soft-capping was not added to GroupQueryAttention
.
Since NaNs are appearing in the model's output, we will add logit soft-capping to GroupQueryAttention
in ONNX Runtime. Once added, I will let you know.
I’m really looking forward to it.
Logit-softcapping has now been added in this PR to GroupQueryAttention
in ONNX Runtime. Here's the PR for adding the attribute to GroupQueryAttention
in the model builder. Once merged, you will need to create the ONNX models again so that they have the new attribute. You will also need to install a nightly version of ONNX Runtime as the change is not in ONNX Runtime v1.19.2.
when it run to line "tensor = numpy_helper.from_array(np_data)" , it just crash with no error. I added try-except block try to catch some error, but I got nothing.
@kunal-vaishnavi