Open aleph65 opened 1 year ago
80GB GPU might be able to export 13B model.
For 30B model, it could be a little tricky to export to ONNX since it will need multiple GPUs, and ONNX Runtime does not have good support for multiple GPU inference right now. Could you try export using cpu device instead of cuda device?
@tianleiwu Thanks so much for the reply.
I will try cpu export instead
1) I heard that for O4 optimization, cuda / GPU is required? 2) You said "onnx runtime does not have good support for multip gpu inference" does that mean that onnx models would not be able to run on multiple GPUs at once? 3) What is the best way to optimize a large model like 30b or 40b to get about 100-200x performance gains? Or at least the 17x inference which appears possible from ONNX, what needs to happen to get that?
Once again thank you, and I'll update when I'm done trying the CPU conversion. I will run this:
optimum-cli export onnx --task text-generation-with-past --model WizardLM-30B-Uncensored WizardLM-30B-ONNX/
- I heard that for O4 optimization, cuda / GPU is required?
- You said "onnx runtime does not have good support for multip gpu inference" does that mean that onnx models would not be able to run on multiple GPUs at once?
- What is the best way to optimize a large model like 30b or 40b to get about 100-200x performance gains? Or at least the 17x inference which appears possible from ONNX, what needs to happen to get that?
ONNX export can be done in CPU. For inference, it is recommended to use GPU for latency consideration.
For FP16, you will need partition 30B model to multiple sub-models to fit GPU memory. Right now, you can create one inference session per sub-model, and write code to call sub-models one by one. It is not a simple task considering you might also need integrate beam search optimization etc.
For best performance, you might need quantize model to 4 bits to fit in one GPU. However, ONNX Runtime does not support 4 bit natively so you might need implement some custom operators for this.
Hey @tianleiwu, some good news! I was able to run the export on my CPU, and run inference!
However, the inference is very slow... and the model folder is massive, 360 GB
I exported in O1 I believe which is the default, just ran a basic command (took about 80 minutes)
For a Llama based 30B model, what is the suggested quantization (O3?) and configuration? Is --arm64 best? I will re-export and quantize if so
How do I get rid of these warnings that "13/selfattn/Constant _43 _output_0'. It is not used by any node and should be removed from the model." etc.? Will additional quantization take care of it?
Kind regards and thanks
For #1, you may ask in optimum. For #2, you can ignore the warning since ONNX Runtime will remove those internally. If you want to remove unused constant in model, you can try python script like the following (only works when the model does not have sub-graph):
import onnx
from onnxruntime.transformers.onnx_model import OnnxModel
m= OnnxModel(onnx.load("input.onnx"))
m.update_graph()
onnx.save(m.model, "output.onnx", save_as_external_data=True)
Have a try with https://github.com/microsoft/onnxruntime/pull/17990
4X A100 could let you export 70B or larger
Describe the issue
Unable to convert 30B model to ONNX. I am using 4x A100's , 500GB RAM, 2.5TB Memory, still running out of memory.
To reproduce
Here's the repro:
I believe this is reproable in any container, but here's the container setup step:
1) Create a container on Runpod from winglian/axolotl-runpod:main-py3.9-cu118-2.0.0
Then deploy 4x A100 in Secure cloud, search for the Template just created:
2) Once it loads, start the terminal and:
3) Paste the following inference file using vim:
Paste this:
To exit vim, Esc -> Shift + Z -> Shift + Z
4) Now, run the conversion:
python fp16_to_onnx.py WizardLM-30B-Uncensored
This will take about 45 minutes, which already sounds a bit wrong as it should take 5m. gpt2 takes 30 seconds to convert.
Then , it will fail with this:
Can you please help unblock? I have been trying to convert this to ONNX for days already
Many thanks
Urgency
I am trying to scale a project an need to convert large models to ONNX, so this is time sensitive, but not critically blocking. A timely response would be much appreciated :)
Platform
Linux
OS Version
Ubuntu 22.04 LTS
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
print(ort.version) --> 1.15.0
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0