salesforce / warp-drive

Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)
BSD 3-Clause "New" or "Revised" License
465 stars 78 forks source link

Unique env with mixed # of threads/block and chained CUDA kernels. Is Warp-Drive appropriate? #97

Closed UsaidPro closed 5 months ago

UsaidPro commented 5 months ago

Hello! I have a weird environment which I am having difficulty implementing in warp-drive. Essentially, the environment has N agents place M units on their own boards. After all agents are done placing units, then the boards are matched vs each other and intensive computations are performed to determine per-agent rewards.

I was thinking I could have a CUDA Step function with N agents (threads) per environment (1 block per env) which would handle overall state/action. When the agents are done performing actions, then a CUDA BoardStep function with M units (threads) per board (1 block per board) would run being fed the mapped state -> board_state input (mapping would be done by a separate CUDA function). I essentially am attempting the below:

step():  # 4 agents per env
    CudaEnvStep(_state_, _action_, _done_)  # 4 agents per block
    if (_done_ && !board_done):
        CudaMapEnvToBoard(_state_, board_state)
    while (_done_ && !board_done):
        CudaBoardStep(board_state, board_done, board_reward)   # 24 units per block
    if (_done_ && board_done):
        CudaCombineRewards(board_reward, _reward_)   # 4 agents per block again

I have implemented the CudaBoardStep(). I am not sure if Warp-Drive's Trainer can handle multiple CUDAFunctionManagers with different threads/block and if this impacts Warp-Drive's performance. Looking at the example environments, I do not see a mixed-thread or chained CUDA kernels environment.

Questions:

  1. Does warp-drive support chained CUDA kernels? Can I make every operation in my step a separate CUDA kernel if necessary and warp-drive will chain them together similar to CUDA Graphs?
  2. Can I have CUDA functions with a different # of threads per block (aka different # of "agents" per environment) mixed within a step() without expecting a significant performance loss?
  3. Would branch/loop operations like if/while run on GPU? I am not sure if the if/while operations are running within PyTorch GPU context or not.
Emerald01 commented 5 months ago

Does warp-drive support chained CUDA kernels? Can I make every operation in my step a separate CUDA kernel if necessary and warp-drive will chain them together similar to CUDA Graphs?

Can I have CUDA functions with a different # of threads per block (aka different # of "agents" per environment) mixed within a step() without expecting a significant performance loss?

Would branch/loop operations like if/while run on GPU? I am not sure if the if/while operations are running within PyTorch GPU context or not.

UsaidPro commented 5 months ago

Thank you so much for the quick response! I just found the Slack link, so I will use that for any future questions I have, sorry for creating this issue. Closing this issue.