salesforce / warp-drive

Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning Framework on a GPU (JMLR 2022)
BSD 3-Clause "New" or "Revised" License
460 stars 78 forks source link

Update README.md for limitations #15

Closed nubonics closed 1 year ago

nubonics commented 2 years ago

In the readme it states that this library is useful / fast for simple RL problems, and that the environment that this library contains was created to be simple for understanding purposes, however, this leads me to my question [apologies as I dont know of another way of sending this message to you w/o a github issue, as its just my lack of understanding, but perhaps it will help others]

What are the limitations of this library?

Could I create an environment suchas a humanoid and create multiple instances of humanoids in one environment and have them learn [" " cheat] from each other to learn the fastest way to get across the environment [100m dash for example]

Could this library be used to train agents within a unity environment [probably wouldnt be actually training in the unity environment itself, but rather visualized in the unity environment after training] ?

Emerald01 commented 2 years ago

Hi, thank you for your interest and your question.

WarpDrive provides a multi-agent RL development framework that allows both environment rollout and training to happen in-place at GPU. We provide the back-end GPU sampler and environment resetter for all environments as well as the high-level Python managers to control a gym-style pipeline. The user can provide a custom CUDA C step function and WarpDrive can automatically couple it with the WarpDrive ecosystem. The fast speed mostly comes from two factors: within WarpDrive, 1. there is no data transfer cost between CPU host and GPU device and 2. each individual GPU thread handles one agent and each GPU block handles one environment replica for the environment step, sample and reset, so it has massively parallel throughput.

As for the limitation, as long as your environment can be treated as agent to thread(worker) mapping, then you just need to write a CUDA C function to reflect the step() logic, i.e., how a action changes states. For example, for humanoid, I can assume each degree of freedom as an agent, and each humanoid instance as an env replica, then in this CUDA step function, I can use multiple threads to represent those motion freedoms respectively(and different thread can run in parallel or sometimes in sequence when they need to communicate) and each GPU block to represent one independent humanoid environment.

Our example TAG environments in this open source, although seems relatively simple, still includes most possible complexities, for example, they can move 360 degrees and in any incremental speed, they will see partial information from other agents and communicate. The taggers will see how to encircle and catch the runners, and vice versa, the runners will learn how to distribute and speed or brake to avoid the catch. We also have other more complicated environment like AI-Economist COVID with WarpDrive, and you can refer to our paper for more details.

I hope I answer some of your questions. And we are more than happy to have more discussions.

nubonics commented 2 years ago

Thank you for such an in-depth response!

Seems I still have ALOT to learn.

Apologies / Kudos!!! Yes, I do understand the complexities of the "simple" example you have provided with the taggers / runners, however, I did not give credit to the complexities prior, and now I am :)

TY! / Background Yes, this is one of the reasons that I am so interested in warp-drive, is because it is generations [seems so] ahead of other libraries that are using the cpu to pass data back and forth, kind of defeating the purpose of gpu training. As I recently thought to myself, why is this taking so long to train on a gpu... and then I discovered that, and I thought to myself, there has to be a faster / better way, which is how I found multi agents for RL, but even that didnt speed up the training, still was only using 0.3% of my gpu at maximum. Then I eventually found warp-drive, so nice to have this as a public library.

Enhancement ? Not sure if numba can help with the cuda c or not, however, it seems to be only able to help with numpy type operations, and I dont know how the step works in depth.

Jargon What is a gpu block? How do I find out how many gpu blocks i have? How do i find out how many threads i have for my gpu?

Clarification My thought process wasnt to assign an agent for each degree of freedom, but rather assign one agent per humanoid, and rather than creating a 1:1 ratio for agent:enviroment, have a many_humanoid_agents:one_environment. I think I am confusing neural nets with RL though. As in a nn, I think each neuron is capable of a degree of freedom [using a sigmoid activation], so perhaps many agents for one humanoid in one enviornment * how many can be ran on the gpu makes sense [ignoring the removing humanoids: environments as the centeralized training progresses]

Limitations Synopsis Please forgive me for my un-educatedness, but as a summary / synopsis, what you are saying with the limitations of warp-drive is: 1: the number of agents within the environment must be able to fit within the capabilities of the gpu 2: step / reset function is required [ as with any RL ]

Emerald01 commented 2 years ago

The original motivation of WarpDrive is trying to fully utilize the GPU resource while reducing its talk to CPU host as much as possible. I think WarpDrive is doing a pretty good job to show this design idea.

So hardware-wise, the limitation for WarpDrive is really mostly the memory and total number of threads available for GPU since we will use GPU 100% almost all the time. Usually, however, the latest GPU can easily support 1000 threads per block and a few thousands blocks in massive parallel, therefore usually the limitation is really the memory. (BTW, block is a group unit for threads, mostly maximum 1024 threads per block)

Implementing a custom step function by the user is inevitable (we have many tools to make it as simple as possible, and many code updates continue to check in to improve it) while we make the reset and sample back-end universally coupled to any environment. With such a framework, the user's job is really just providing a custom step function and all others will be automatically wrapped up by WarpDrive to speed up your RL by a few orders of magnitude. Usually it is worth it if you want to reduce your training time from weeks to a few hours, and saves a lot of training budget and electricity. As for how to create such a step function, it is case by case. In general for multi-agent RL we prefer 1:1 mapping for an agent to a GPU thread, and an env replica to a GPU block, it can reach almost perfect parallelism. For humanoid, we can discuss the details. If you define an agent as a complete humanoid with all degrees of freedom, then it seems to me reasonable to use each thread to execute one.