Error converting mistral to onnx

meomeomeome commented 7 months ago

Describe the bug Error converting mistral to onnx

Expected behavior

!pip install virtualenv
!virtualenv myenv
!source /content/myenv/bin/activate

!git clone https://github.com/neuralmagic/sparseml
#!pip install sparseml
!pip install -e "sparseml[transformers]"

#!pip uninstall transformers
#!pip install nm-transformers

!python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py OpenBuddy/openbuddy-mistral-7b-v13.1 open-platypus --recipe recipe.yaml --device cuda:0 --precision float16 --save True

Environment Include all relevant environment information:

OS [e.g. Ubuntu 18.04]: 22
Python version [e.g. 3.7]: 3.11
SparseML version or commit hash [e.g. 0.1.0, f7245c8]:
ML framework version(s) [e.g. torch 1.7.1]:
Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]:
Other relevant environment information [e.g. hardware, CUDA version]:

To Reproduce Exact steps to reproduce the behavior:

Errors !python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment !cp deployment/model.onnx deployment/model-orig.onnx

Traceback (most recent call last):
  File "/content/sparseml/src/sparseml/transformers/sparsification/obcq/export.py", line 542, in <module>
    main()
  File "/content/sparseml/src/sparseml/transformers/sparsification/obcq/export.py", line 529, in main
    export(
  File "/content/sparseml/src/sparseml/transformers/sparsification/obcq/export.py", line 507, in export
    export_transformer_to_onnx(
  File "/content/sparseml/src/sparseml/transformers/sparsification/obcq/export.py", line 345, in export_transformer_to_onnx
    export_onnx(
  File "/content/sparseml/src/sparseml/pytorch/utils/exporter.py", line 488, in export_onnx
    out = tensors_module_forward(sample_batch, module, check_feat_lab_inp=False)
  File "/content/sparseml/src/sparseml/pytorch/utils/helpers.py", line 414, in tensors_module_forward
    return module(**tensors)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 1083, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 970, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 659, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/mistral/modeling_mistral.py", line 299, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/ao/quantization/stubs.py", line 63, in forward
    X = self.module(X)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/ao/nn/qat/modules/linear.py", line 41, in forward
    return F.linear(input, self.weight_fake_quant(self.weight), self.bias)
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
cp: cannot stat 'deployment/model.onnx': No such file or directory

meomeomeome commented 7 months ago

I also want to add that after quantization and optimization, the model remains the same size. Although the recipe says quantization 8

dbogunowicz commented 7 months ago

Hey @meomeomeome Regarding your export issue, please use the following entrypoint for export; sparseml.export --task text-generation --model_path obcq_deployment

Regarding the model size issue, could you paste here an artifact that illustrates the comparison? Perhaps some stdout from du -sh * or tree ?

meomeomeome commented 7 months ago

sparseml.export --task text-generation --model_path obcq_deployment

In your instructions at https://github.com/neuralmagic/sparseml/tree/main/src/sparseml/transformers/sparsification/obcq

Model preparation is done with this command python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py HuggingFaceH4/zephyr-7b-beta open_platypus --recipe recipe.yaml --precision float16 --save True

Those. we load the model in float16 format

Next is the conversion script python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment Which is unable to perform half operations on the CPU I studied the python files sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment and the file src/sparseml/pytorch/utils/exporter.py from the library, there is an explicit loading of models into the CPU

Does your suggestion

sparseml.export --task text-generation --model_path obcq_deployment

solve the problem of exporting a model in onnx to float16? What library files are used for this and how is the problem of incompatibility of CPU operations with float16 solved?

Based on the size of the model, from my experience with Tiny LLama, I realized that the final reduction of the model occurs after complete conversion to onnx in the deployment folder

meomeomeome commented 7 months ago

sparseml.export --task text-generation --model_path obcq_deployment
dont have options --model_path also i can't try all for end procces convert to onnx because conversion kill process with 83 gb memory when bin files of model only 15 gb

dbogunowicz commented 7 months ago

Let me take a look, will come back to you shortly

dbogunowicz commented 7 months ago

Hey @meomeomeome

Short update from my side: I tried to recreate your problem locally.

I generated your obcq_deployment directory
Exported the model using sparseml.export obcq_deployment --trust_remote_code --sequence_length 64 --task text-generation. I confirm that the export takes a prohibitively large amount of CPU memory during the export. However, by specifying --sequence_lenght {int} argument, you can potentially reduce your peak memory consumption. Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model. This is a big issue and something that we are currently working on.
I was also able to reproduce the export error in obcq/export.py (RuntimeError: "addmm_impl_cpu_" not implemented for 'Half' ). Not sure why you see this using the pathway. While we are looking into the issues, please note that this pathway will over time get deprecated in favor of sparseml.export.

meomeomeome commented 7 months ago

Hey @meomeomeome

Short update from my side: I tried to recreate your problem locally.

I generated your obcq_deployment directory

Exported the model using sparseml.export obcq_deployment --trust_remote_code --sequence_length 64 --task text-generation. I confirm that the export takes a prohibitively large amount of CPU memory during the export. However, by specifying --sequence_lenght {int} argument, you can potentially reduce your peak memory consumption. Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model. This is a big issue and something that we are currently working on.

I was also able to reproduce the export error in obcq/export.py (RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'). Not sure why you see this using the pathway. While we are looking into the issues, please note that this pathway will over time get deprecated in favor of sparseml.export.

Setting it to something smaller like 32 or 64 should work, but will naturally limit the capabilities of your model - What do you mean? Will this slow down the export process, or will the model lose quality after exporting to onnx? P.S. base model context window 4096

dbogunowicz commented 7 months ago

When running in our deepsparse pipeline you will not be able to generate more than e.g. 64 - num_tokens(prompt) tokens in the single inference. This will however reduce peak mem consumption as well as accelerate the export process

meomeomeome commented 7 months ago

When running in our deepsparse pipeline you will not be able to generate more than e.g. 64 - num_tokens(prompt) tokens in the single inference. This will however reduce peak mem consumption as well as accelerate the export process

Does this apply only to the pipeline or any other output methods?

import psutil
import time
# Получение информации о памяти и CPU
memory_usage = psutil.virtual_memory()
cpu_frequency = psutil.cpu_freq()

print(f"Total Memory: {memory_usage.total / (1024 ** 3)} GB")
print(f"Memory Used: {memory_usage.used / (1024 ** 3)} GB")

# Получение информации о CPU
cpu_frequency = psutil.cpu_freq(percpu=True)
cpu_count = psutil.cpu_count(logical=False)
cpu_logical_count = psutil.cpu_count(logical=True)
cpu_model = None
with open("/proc/cpuinfo", "r") as f:
    for line in f:
        if "model name" in line:
            cpu_model = line.strip().split(":")[1].strip()
            break

# Вывод информации
print(f"CPU Model: {cpu_model}")
print(f"Physical Cores: {cpu_count}")
print(f"Logical Cores (including hyperthreading): {cpu_logical_count}")

for i, freq in enumerate(cpu_frequency):
    print(f"Core {i}: {freq.current / 1000:.2f} GHz")

print(f"Total CPU Frequency: {psutil.cpu_freq().current / 1000:.2f} GHz")

prompt = "How to make banana bread?"
formatted_prompt =  f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
# Измерение до инференса
memory_before = psutil.virtual_memory().used

start_time = time.time()
output = model(formatted_prompt, max_new_tokens=500).generations[0].text
end_time = time.time()

# Измерение после инференса
memory_after = psutil.virtual_memory().used

print(f"Inference Time: {end_time - start_time} seconds")
print(f"Memory Used During Inference: {(memory_after - memory_before) / (1024 ** 2)} MB")

Resullt and speed model loaded by from deepsparse import TextGeneration in memory TinyLlama 1.19GB(converted seq_l 128)-- 19 seconds!!

Total Memory: 50.993690490722656 GB
Memory Used: 3.8117218017578125 GB
CPU Model: Intel(R) Xeon(R) CPU @ 2.20GHz
Physical Cores: 4
Logical Cores (including hyperthreading): 8
Core 0: 2.20 GHz
Core 1: 2.20 GHz
Core 2: 2.20 GHz
Core 3: 2.20 GHz
Core 4: 2.20 GHz
Core 5: 2.20 GHz
Core 6: 2.20 GHz
Core 7: 2.20 GHz
Total CPU Frequency: 2.20 GHz
Inference Time: 19.88378143310547 seconds
Memory Used During Inference: 2.390625 MB
Banana bread is a delicious and nutty bread that is easy to make. Here is a recipe for banana bread:

Ingredients:

    1 1/2 cups flour
    1/2 cup sugar
    1/2 cup baking powder
    1/2 cup whole milk
    1/4 cup oil
    1/4 cup eggs
    1/4 cup raisins
    1/4 cup raisin bread crumbs
    1/4 cup pecans
    Salt
    Sugar
    Bread
    Flour
    Water
    Butter
    Oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt
    baking powder
    oil
    eggs
    milk
    flour
    sugar
    bread
    oil
    eggs
    milk
    raisins
    raisin bread crumbs
    pecans
    salt

And i have 2 questions

Does this apply only to the pipeline or any other output methods? (seq_l) And what method most speed for inferense? (i intrested load model form my disk and memory)

dbogunowicz commented 7 months ago

I do not understand the two questions, could you rephrase them, please?

I imagine that if you run the exported post-obcq ONNX model in the deepsparse pipeline (as you do above), setting a small sequence_length on the export may mess up some models. This is because the sequence_length set during export influences the size of the positional embeddings available for the exported model. As a result, you may get unexpected errors. I see that you are getting satisfying results for your model, so maybe that is not the case for this particular network. @mgoin Could you take a look? Is my hypothesis more or less correct?

meomeomeome commented 6 months ago

As I understand it, no one knows how to solve the export problem without limiting the context window --sequence_length 64. If you leave it the same as in the base model, when exporting, the memory for exporting the model with the original size of 15 GB takes up the entire memory of 83 GB. Does anyone know the methodology of how to solve this using batch size or distributing processing in parts?

jeanniefinks commented 5 months ago

@meomeomeome This is a known issue exporting requires a lot of memory, depending on the sequence_length. We'll be noting this as a known issue in the pending 1.7 product release.

jeanniefinks commented 5 months ago

Hello @meomeomeome A heads up that 1.7 recently went out. We hope this can address the issue you faced. Thank you! Jeannie / Neural Magic

neuralmagic / sparseml

Error converting mistral to onnx #2018