FusedMHARunnerFP16v2 will cause onnxruntime coredump when multi-host-threads run session.run()

zwyao commented 1 month ago

Describe the issue

in my bert model，when i use head-size == 32，the attention cuda kernel will cause ort codedump，the error msg says “cuda illegal memory access was encountered”. i find the reason is the FusedMHARunnerFP16v2 dose not support concurrent running.

To reproduce

attention_bug_fix.txt

this is my fix code

Urgency

No response

Platform

Linux

OS Version

1.18.0

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.0 master

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

zwyao commented 1 month ago

i find the bug dose not be fixed in the latest version 1.19.1

tianleiwu commented 1 month ago

@zwyao, The thread-safe for self attention FusedMHARunnerFP16v2 was fixed in https://github.com/microsoft/onnxruntime/pull/21420. There was another fix for cross-attention. The bug was resolved in 1.19.0 release. Please try 1.19.2.

zwyao commented 1 month ago

@zwyao, The thread-safe for self attention FusedMHARunnerFP16v2 was fixed in #21420. There was another fix for cross-attention. The bug was resolved in 1.19.0 release. Please try 1.19.2.

emmm, thanks

microsoft / onnxruntime