[OOM] Unable to convert 30B Model

aleph65 commented 1 year ago

Describe the issue

Unable to convert 30B model to ONNX. I am using 4x A100's , 500GB RAM, 2.5TB Memory, still running out of memory.

To reproduce

Here's the repro:

I believe this is reproable in any container, but here's the container setup step:

1) Create a container on Runpod from winglian/axolotl-runpod:main-py3.9-cu118-2.0.0

Runpod.io -> My Templates -> New Template -> winglian/axolotl-runpod:main-py3.9-cu118-2.0.0

Then deploy 4x A100 in Secure cloud, search for the Template just created:

2) Once it loads, start the terminal and:

mkdir tmp && ln -s /workspace/tmp /tmp
pip install optimum && pip install onnx && pip install onnxruntime-gpu
git lfs install
git clone https://huggingface.co/ehartford/WizardLM-30B-Uncensored

3) Paste the following inference file using vim:

touch fp16_to_onnx.py
vim fp16_to_onnx.py

Paste this:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from optimum.onnxruntime import ORTModelForCausalLM
import argparse
import os

parser = argparse.ArgumentParser(description="Convert fp16 model to onnx")
parser.add_argument("model_dir", type=str, help="fp16 model folder")
parser.add_argument("--device", type=str, default="cuda:0", help="device")

args = parser.parse_args()

model_dir = args.model_dir

device = torch.device("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_dir)
# model = AutoModelForCausalLM.from_pretrained(
#     model_dir,
#     torch_dtype=torch.float16,
#     low_cpu_mem_usage=True,
#     trust_remote_code=True,
# ).to(device)

save_directory = "onnx_wiz/"
print("Loading")
ort_model = ORTModelForCausalLM.from_pretrained(
    model_dir, export=True).to(device)

print("Saving")
ort_model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

To exit vim, Esc -> Shift + Z -> Shift + Z

4) Now, run the conversion:

python fp16_to_onnx.py WizardLM-30B-Uncensored

This will take about 45 minutes, which already sounds a bit wrong as it should take 5m. gpt2 takes 30 seconds to convert.

Then , it will fail with this:

Can you please help unblock? I have been trying to convert this to ONNX for days already

Many thanks

Urgency

I am trying to scale a project an need to convert large models to ONNX, so this is time sensitive, but not critically blocking. A timely response would be much appreciated :)

Platform

Linux

OS Version

Ubuntu 22.04 LTS

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

print(ort.version) --> 1.15.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

tianleiwu commented 1 year ago

80GB GPU might be able to export 13B model.

For 30B model, it could be a little tricky to export to ONNX since it will need multiple GPUs, and ONNX Runtime does not have good support for multiple GPU inference right now. Could you try export using cpu device instead of cuda device?

aleph65 commented 1 year ago

@tianleiwu Thanks so much for the reply.

I will try cpu export instead

1) I heard that for O4 optimization, cuda / GPU is required? 2) You said "onnx runtime does not have good support for multip gpu inference" does that mean that onnx models would not be able to run on multiple GPUs at once? 3) What is the best way to optimize a large model like 30b or 40b to get about 100-200x performance gains? Or at least the 17x inference which appears possible from ONNX, what needs to happen to get that?

Once again thank you, and I'll update when I'm done trying the CPU conversion. I will run this:

optimum-cli export onnx --task text-generation-with-past --model WizardLM-30B-Uncensored WizardLM-30B-ONNX/

tianleiwu commented 1 year ago

I heard that for O4 optimization, cuda / GPU is required?

You said "onnx runtime does not have good support for multip gpu inference" does that mean that onnx models would not be able to run on multiple GPUs at once?

What is the best way to optimize a large model like 30b or 40b to get about 100-200x performance gains? Or at least the 17x inference which appears possible from ONNX, what needs to happen to get that?

ONNX export can be done in CPU. For inference, it is recommended to use GPU for latency consideration.

For FP16, you will need partition 30B model to multiple sub-models to fit GPU memory. Right now, you can create one inference session per sub-model, and write code to call sub-models one by one. It is not a simple task considering you might also need integrate beam search optimization etc.

For best performance, you might need quantize model to 4 bits to fit in one GPU. However, ONNX Runtime does not support 4 bit natively so you might need implement some custom operators for this.

aleph65 commented 1 year ago

Hey @tianleiwu, some good news! I was able to run the export on my CPU, and run inference!

However, the inference is very slow... and the model folder is massive, 360 GB

I exported in O1 I believe which is the default, just ran a basic command (took about 80 minutes)

For a Llama based 30B model, what is the suggested quantization (O3?) and configuration? Is --arm64 best? I will re-export and quantize if so
How do I get rid of these warnings that "13/selfattn/Constant _43 _output_0'. It is not used by any node and should be removed from the model." etc.? Will additional quantization take care of it?

Kind regards and thanks

tianleiwu commented 1 year ago

For #1, you may ask in optimum. For #2, you can ignore the warning since ONNX Runtime will remove those internally. If you want to remove unused constant in model, you can try python script like the following (only works when the model does not have sub-graph):

import onnx
from onnxruntime.transformers.onnx_model import OnnxModel
m= OnnxModel(onnx.load("input.onnx"))
m.update_graph()
onnx.save(m.model, "output.onnx", save_as_external_data=True)

wejoncy commented 11 months ago

Have a try with https://github.com/microsoft/onnxruntime/pull/17990

4X A100 could let you export 70B or larger

microsoft / onnxruntime