huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.54k stars 454 forks source link

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

Open niyathimariya opened 6 days ago

niyathimariya commented 6 days ago

System Info

Optimum version: 1.22.0
Platform: Linux (Ubuntu 22.04.4 LTS)
Python version: 3.12.2
ONNX Runtime Version: 1.19.2
CUDA Version: 12.1
CUDA Execution Provider: Yes (CUDA 12.1)

Who can help?

@JingyaHuang @echarlaix

Information

Tasks

Reproduction (minimal, reproducible, runnable)

def load_model(self, model_name):
    session_options = ort.SessionOptions()
    session_options.add_session_config_entry('cudnn_conv_use_max_workspace', '0')
    session_options.enable_mem_pattern = False
    session_options.arena_extend_strategy = "kSameAsRequested"
    session_options.gpu_mem_limit = 10 * 1024 * 1024 * 1024

    model = ORTModelForSeq2SeqLM.from_pretrained(model_name, provider="CUDAExecutionProvider", session_options=session_options)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer, model

def inference(self, batch, doc_id='-1'):
    responses, status = '', False
    try:
        encodings = self.tokenizer(batch, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(self.device)
        with torch.no_grad():
            generated_ids = self.model.generate(
                encodings.input_ids,
                max_new_tokens=1024
            )
            responses = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            status = True  
    except Exception as e:
        logger.error(f"Failed to do inference on LLM, error: {e}")

    torch.cuda.empty_cache()
    return status, responses

Expected behavior

I expect the CUDA memory to decrease and be released after processing smaller inputs, optimizing memory usage for subsequent inputs. Picture1

IlyasMoutawwakil commented 3 days ago

Hi, the code you provided doesn't explain how you got the chart in you issue and what is "sample number" in this case ?

niyathimariya commented 3 days ago

Hi @IlyasMoutawwakil, the code I’ve provided shows how I’m loading the model and performing inference. I’ve also included a graph showing the GPU memory consumed as inferencing progresses (I recorded the GPU usage after each inference by using the following code:

result = subprocess.run(['nvidia-smi', '--query-compute-apps=pid,gpu_name,used_memory', '--format=csv,noheader,nounits'], 
                        stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

if result.returncode != 0:
    print("Failed to run nvidia-smi:", result.stderr)
    return None
gpu_processes = result.stdout.strip().split('\n')

for process in gpu_processes:
    process_info = process.split(', ')
    process_pid = process_info[0]

    if process_pid == str(pid):
        used_memory_mib = int(process_info[2])

Graph demonstrate that the model, which I’ve converted to ONNX using the save_pretrained() method, is not releasing memory when it encounters a lower input sequence after processing a higher input sequence, whereas the PyTorch model releases memory in such cases.

I've plotted another graph showing input shape (batch size,sequence length) Picture1

IlyasMoutawwakil commented 3 days ago

and I assume, "sample number" is supposed to mean sequence length ? edit: okay thanks I see the updated graph. either ways, this doesn't seem like an optimum issue, but rather on onnxruntime side (the inference session), since it's the part that handles memory allocation and release.

niyathimariya commented 2 hours ago

Thanks, @IlyasMoutawwakil. Do you think this is normal behavior for ONNX Runtime?