Open heinerb opened 2 years ago
@sven1977 Adding a little more detail.
The following image show single champion:
Each Worker of which there is 60 has a RLLIB PolicyMap which is configured to cache 10 Policies in memory. Once capacity is reached the oldest polices are mapped to disk and loaded next time used.
Note: In a case of cache capacity for policy of 10 we will transition items to and from disk lowering throughput but preserving memory at reasonable values,
Going back to the PolicyMap - this is at the worker level as noted above and not at the node level leaving us with multiple duplications of the above noted policies per worker.
Manageable for single champion and can increase past 60 but limits will clearly be reached... Multi Agent Training has similar constraints but are less likely to experience in simple agent on agent setups. NvN agent training would assume to hit similar limits as champion vs multiple challenger policies
Let's show an example for small league where there is multiple champions.
This setup extends and run six champions which has a mix of Main-Agents and Main-Exploiters (with various selection stratigies).
So the possible combination of policies in a given worker are:
Now in any given Match there will be only 1 champion + 8 Challenger policies used... Which can be handled by the policy map. So we can cache items which helps at cost of the throughput but sure that allows us to run. However, in the example above we get to less than 10% memory free on nodes which can cause stability concerns in ray/rllib. (Note: Ray will post a warning at every report print to console for this)
looking to see if there is an implementation detail that I am missing here for how to implement large groups of varying policies (act, obs, model - optionally include rewards and dones) for setup.
Thanks for this long post. Great info. Policy map currently is per rollout worker. I guess you are proposal to make it per Ray worker node?? My wild guess is that while it will certainly ease pressure on memory, but may result in performance issues and code complexity. The work described in this post is awesome. I wonder if we should add your study as an example, so your knowledge and tuning of hundreds of policies is shared with the community.
Thanks for this long post. Great info. Policy map currently is per rollout worker. I guess you are proposal to make it per Ray worker node?? My wild guess is that while it will certainly ease pressure on memory, but may result in performance issues and code complexity. The work described in this post is awesome. I wonder if we should add your study as an example, so your knowledge and tuning of hundreds of policies is shared with the community.
Happy to have this added for example.
Search before asking
Ray Component
RLlib
What happened + What you expected to happen
Currently exploring heterogenous league-based training with RLLIB and running into scalability issues/implementation issues which are noted bellow:
League Terminology Used In this Post:
env_runner
in sampler is implemented, The resetting/initializing running environments for challenger policy between training iterations/matches (train step/callback for training results) must occur at the end of the first episode of the next training iteration/match. This leaves the next training iteration/match as it were with samples from previous tournament match. Work around in provided example script.https://github.com/ray-project/ray/blob/596c8e27726075623d51428c549304bc0f141f8d/rllib/evaluation/sampler.py#L963-L969
policy_map_capacity
settings which in example is set to 2 to reduce to min policies in memory. However, when scaled in the CartPole-v0 and MountainCarContinuopus-v0 examples across 250 workers this can take 250GB of memory. Question is there a recommended way to accomplish the multiple policies in each worker such that it supports Heterogenous league like runs where challengers are constantly loaded in without blowing through memory.128 cores, 256 threads, v100, 256GB Memory
Note: I am willing to provide more involved examples and answer questions as needed to resolve mapping issue.
Question is it possible to have PolicyMap per node vs per worker? It is almost like the policy map should be a shared resource on a node vs being duplicated per worker on the node.
Versions / Dependencies
Ray Version:
Tensor Flow
Python
OS
Reproduction script
Anything else
No response
Are you willing to submit a PR?