microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.06k stars 2.83k forks source link

Why cuda provider allocator must be threadlocal? #8378

Open MyPandaShaoxiang opened 3 years ago

MyPandaShaoxiang commented 3 years ago

Describe the bug I have found that cuda provider uses thread_local allocator to alloc normal gpu memory, and it will cause many memory occupation when I launch many thread to run on single device. When I read the implement, I see the comment "A hypothesis is that arena allocator is not aligned with CUDA output cache, and data from different kernel writes may cause cacheline to contain dirty data.". Can anyone give me some explaination about what the comment mean ? Can we avoid this problem with no thread_local allocator?

snnn commented 3 years ago

@KeDengMS , can you answer this?

MyPandaShaoxiang commented 3 years ago

Can anyone else can answer this issue? This is really gnawing problem.