microsoft / onnxruntime-genai

Generative AI extensions for onnxruntime
MIT License
340 stars 79 forks source link

`models/builder.py` creates empty output directory #675

Closed mlinke-ai closed 1 day ago

mlinke-ai commented 2 weeks ago

I have downloaded the microsoft/phi-3-mini-128k-instruct model from huggingface using the huggingface-cli script.

When I try to convert the model to ONNX format, the directory specified with the -o flag stays empty.

I use the following command:

python -m onnxruntime_genai.models.builder -m phi-3-mini-128k-instruct-onnx-cpu -i phi-3-mini-128k-instruct -o phi-3-mini-128k-instruct-onnx-cpu -p int4 -e cpu -c cache --extra_options int4_accuracy_level=1 filename=phi3-mini-128k-instruct-onnx-cpu.onnx
kunal-vaishnavi commented 2 weeks ago

https://github.com/microsoft/onnxruntime-genai/blob/4b0dd43064287dfbd6f13f16593d36b5525c0cdc/src/python/py/models/builder.py#L2412-L2430

Since you already downloaded the model onto disk and you are providing an input path to a directory, you don't need the -m phi-3-mini-128k-instruct-onnx-cpu part in your command. Can you omit that part and try again?

mlinke-ai commented 2 weeks ago

Omitting the -m phi-3-mini-128k-instruct-onnx-cpu did not work. The directory is still empty.

I have some warnings about missing flash_attn. But I don't think this could be the problem.

natke commented 2 weeks ago

Hi @mlinke-ai, can you share the full output of the command here. And what are the specs of the machine you are running on?

mlinke-ai commented 1 week ago

The complete output of the command is as following (slightly shortened to remove some clutter):

Valid precision + execution provider combinations are: FP32 CPU, FP32 CUDA, FP16 CUDA, FP16 DML, INT4 CPU, INT4 CUDA, INT4 DML
Extra options: {'int4_accuracy_level': '1', 'filename': 'phi3-mini-128k-instruct-onnx-cpu.onnx'}
C:\Users\mlinke\AppData\Roaming\Python\Python310\site-packages\transformers\models\auto\configuration_auto.py:950: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
GroupQueryAttention (GQA) is used in this model.
C:\Users\mlinke\AppData\Roaming\Python\Python310\site-packages\transformers\models\auto\auto_factory.py:469: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
2024-07-08 13:36:31,595 transformers_modules.phi-3-mini-128k-instruct.modeling_phi3 [WARNING] - `flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
2024-07-08 13:36:31,595 transformers_modules.phi-3-mini-128k-instruct.modeling_phi3 [WARNING] - Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|##########| 2/2 [00:22<00:00, 11.11s/it]
Reading embedding layer
Reading decoder layer 0
Reading decoder layer 1
Reading decoder layer 2
...
Reading decoder layer 29
Reading decoder layer 30
Reading decoder layer 31
Reading final norm
Reading LM head
Saving ONNX model in \\?\C:\Users\mlinke\Documents\ML\repos\phi-3-mini-128k-instruct-onnx-cpu

My machine is a Lenovo ThinkBook 15 G3 ACL with the following specs:

kunal-vaishnavi commented 1 week ago

It looks like you are running out of memory. Do you have a larger machine you can use?

I am working on an improved method to load these large models using mmap instead to avoid out-of-memory errors such as this one and the above ones you've faced. With mmap, the model builder can then adapt to the machine's memory constraints.

mlinke-ai commented 1 day ago

Sorry for the long delay, company processes are slow sometimes.

Sadly, I don't have access to a machine with more RAM. Looking forward for your implementation using mmap.

natke commented 1 day ago

I'll close this issue for now. We will announce when we have the memory improvements.