Why cuda provider allocator must be threadlocal?

MyPandaShaoxiang commented 3 years ago

Describe the bug I have found that cuda provider uses thread_local allocator to alloc normal gpu memory, and it will cause many memory occupation when I launch many thread to run on single device. When I read the implement, I see the comment "A hypothesis is that arena allocator is not aligned with CUDA output cache, and data from different kernel writes may cause cacheline to contain dirty data.". Can anyone give me some explaination about what the comment mean ? Can we avoid this problem with no thread_local allocator?

snnn commented 3 years ago

@KeDengMS , can you answer this?

MyPandaShaoxiang commented 3 years ago

Can anyone else can answer this issue? This is really gnawing problem.

microsoft / onnxruntime

Why cuda provider allocator must be threadlocal? #8378