Open niyathimariya opened 6 days ago
Hi, the code you provided doesn't explain how you got the chart in you issue and what is "sample number" in this case ?
Hi @IlyasMoutawwakil, the code I’ve provided shows how I’m loading the model and performing inference. I’ve also included a graph showing the GPU memory consumed as inferencing progresses (I recorded the GPU usage after each inference by using the following code:
result = subprocess.run(['nvidia-smi', '--query-compute-apps=pid,gpu_name,used_memory', '--format=csv,noheader,nounits'],
stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
if result.returncode != 0:
print("Failed to run nvidia-smi:", result.stderr)
return None
gpu_processes = result.stdout.strip().split('\n')
for process in gpu_processes:
process_info = process.split(', ')
process_pid = process_info[0]
if process_pid == str(pid):
used_memory_mib = int(process_info[2])
Graph demonstrate that the model, which I’ve converted to ONNX using the save_pretrained() method, is not releasing memory when it encounters a lower input sequence after processing a higher input sequence, whereas the PyTorch model releases memory in such cases.
I've plotted another graph showing input shape (batch size,sequence length)
and I assume, "sample number" is supposed to mean sequence length ? edit: okay thanks I see the updated graph. either ways, this doesn't seem like an optimum issue, but rather on onnxruntime side (the inference session), since it's the part that handles memory allocation and release.
Thanks, @IlyasMoutawwakil. Do you think this is normal behavior for ONNX Runtime?
System Info
Who can help?
@JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Expected behavior
I expect the CUDA memory to decrease and be released after processing smaller inputs, optimizing memory usage for subsequent inputs.