Open ongjunjie opened 5 years ago
There are a few options for supporting this:
Allow it, but only in cases where any race conditions are benign (storing the same value to the same site). This would probably be implemented by just disallowing update definitions in cases where we can't prove no cross-talk.
Allow it, but interpret it as a separate memory region per parallel task, so that there's no possibility of cross-talk. This is a bit weird because really you have a large array of allocations, instead of a single allocation. The number of allocations is going to be the trip count of the outer parallel loop, which is inefficient, because really you'd like it to be the max number of parallel tasks, but then you'd have to manage it with an allocator...
Solve the use case more specifically by supporting compute_at(g, gpu_blocks).store_in(MemoryType::Global) or similar, then implement it however we like. The first thing to try would be just calling malloc inside the generated kernel.
Detect cases where we're going to allocate too much shared memory automatically and lower those to calling malloc inside the kernel. We could also make a pool of global memory and manage it with our own allocator, if malloc proves to be weirdly slow.
Thoughts?
Use case for this is when running on GPU, some intermediate stages need to store in global memory as available shared memory is too little. Today, this behaviour can be only be achieved by scheduling these stages to run in its own kernel rather than fusing with its consumer.
Toy example:
This gives the error