Open destin-v opened 3 months ago
Thank you @destin-v ! I'll look into it on the weekend and will run some performance tests
@MischaPanch , in order to get optimal results from using Ray I recommend you provide a vectorized environment to each of the Ray remote workers. When running the vectorized step()
function on the vectorized environments it should take approximately ~1sec. The idea is that you want to saturate the CPUs with work for at least 1 sec.
Then gather all of the Ray reference objects (sometimes called futures) into a list. Then perform a ray.get()
on your list of Ray references objects. This will deserialize your Ray reference objects into their true values.
If done properly, you will see a speedup because the number of serialization/deserializations will decrease and your utilization of the CPU will increase.
Let me know if you need any help.
Ok, I looked into this in more detail. You are right, the design is suboptimal. There is however the overall question whether the ray worker is important for tianshou. The only reason I would see for using it is if one wants to start the training on a multi-node cluster. This is currently not the main focus of Tianshou, there are many more important problems to fix before that. Several practical algorithms like SAC cannot be arbitrarily parallelized anyway. There's also rllib for a tight integration with ray (I had extremely bad experiences with it, but still).
The solution of this issue would require quite some work. Is this feature important for you? Are you using multi-node training, and if not, is the current subproc vectorized env not sufficient for you?
Related to #1133
I am interested in Tianshou's RayVecEnv because I have access to a multi-node computing infrastructure. I would be willing to contribute a new design for RayVecEnv. Due to multiple priorities I may not get to it in the near term. But looking at the code for RayVecEnv I think it will require a complete rewrite to get a double buffering solution.
RLLib is heavily abstracted and inherits from many classes that are not needed for RL (i.e. Tune). This has led to coupling issues where bugs have propagated across RLLib making it difficult to fix or debug. Even though RLLib provides multi-node support, it has its own unique problems.
@destin-v fully agree, in fact rllib's large number of problems is the main reason for me for investing significant effort into tianshou. I feel that tianshou might strike the right balance between useful abstractions while still not being overwhelming for researchers, and not being breaking and bug-prone like rllib.
I also agree that RayVecEnv needs to be redesigned from the ground up, which is precisely why I was reluctant to do it myself now. If you want to collaborate on this, I'm happy to discuss, review, and participate to some extent. Let me know when you have time and let's come back to this issue then.
A set of benchmarks for RayVecEnv
showing the scaling efficiency across distributed nodes.
step()
function that sleeps for x
seconds.step()
calls in RayVecEnv are counted.step()
functions will be sub-optimal because they will make many communication calls.step()
functions will be more optimal because they will make less communication calls.[!NOTE]
- SPS: Steps per Second
Number of Nodes
designate individual computing machines. The number ofcores
on eachnode
is 48. Hence the totalNumber of Cores
is alwaysNumber of Nodes
multiplied by48
.Ideal SPS
is theNumber of Cores
divided byStep Duration
. This assumes that communication costs are zero using Ray.Efficiency
is equal to measuredSPS
divided byIdeal SPS
.- The
Efficiency
column is the single best metric to evaluate whether adding distributed computing will benefit your training process.
Number of Nodes | Number of Cores | Step Duration (sec) | SPS | SPS/Core | Ideal SPS | Efficiency |
---|---|---|---|---|---|---|
2 | 96 | 1 | 91.97 | 0.96 | 96 | 96% |
2 | 96 | 0.1 | 680.66 | 7.09 | 960 | 71% |
2 | 96 | 0.01 | 3,130.63 | 32.61 | 9,600 | 33% |
2 | 96 | 0.001 | 3,941.47 | 41.06 | 96,000 | 4% |
4 | 192 | 1 | 181.92 | 0.95 | 192 | 95% |
4 | 192 | 0.1 | 1,285.94 | 6.70 | 1,920 | 67% |
4 | 192 | 0.01 | 3,962.63 | 20.64 | 19,200 | 21% |
4 | 192 | 0.001 | 4,089.34 | 21.30 | 192,000 | 2% |
8 | 384 | 1 | 351.10 | 0.91 | 384 | 91% |
8 | 384 | 0.1 | 2,153.68 | 5.61 | 3,840 | 56% |
8 | 384 | 0.01 | 4,348.30 | 11.32 | 38,400 | 11% |
8 | 384 | 0.001 | 4,226.37 | 11.01 | 384,000 | 1% |
16 | 768 | 1 | 660.87 | 0.86 | 768 | 86% |
16 | 768 | 0.1 | 3,118.94 | 4.06 | 7,680 | 41% |
16 | 768 | 0.01 | 4,210.96 | 5.48 | 76,800 | 5% |
16 | 768 | 0.001 | 4,113.67 | 5.36 | 768,000 | 1% |
32 | 1,536 | 1 | 1,191.83 | 0.78 | 1,536 | 78% |
32 | 1,536 | 0.1 | 4,040.86 | 2.63 | 15,360 | 26% |
32 | 1,536 | 0.01 | 4,303.75 | 2.80 | 153,600 | 3% |
32 | 1,536 | 0.001 | 4,267.77 | 2.79 | 1,536,000 | 0% |
These results show that environments with slow steps (>1sec) will greatly benefit from RayVecEnv
in its current state. But for environments that step faster (<0.01sec), the communication costs of Ray outweigh the benefits.
Note the key findings from the experimental results:
Step Duration==0.001sec
, Efficiency
was always in the low single digits.Step Duration==1sec
, Efficiency
was always in the high double digits.Since Efficiency
is a measure of how close your process is to achieving the ideal speedup, it provides the best single metric for evaluating the performance of distributed computing.
[!NOTE] Environments like Cartpole step very fast (<0.0001sec) meaning there is likely no benefit to retrieving a single step of Cartpole from a distributed core under the current
RayVecEnv
implementation. But if 1M steps of Cartpole are aggregated on the remote core and then sent back to the head node over Ray, the communication efficiency would go way up.
@destin-v thank you for this very thorough and clear evaluation! I will include these results in the docs for the next release, if you don't mind.
It is unfortunate that the overhead is so large. Envs with a single step per second are kind of doomed - even with simple envs agents need roughly 1e6 steps to converge to anything. Highly parallelizable algos will need even more. So having such a slow env in the first place is pretty much a no-go. Maybe some very slow envs would be 0.1 seconds per step, but at least for research purposes it's way too slow.
Your results just confirm to me that optimizing for multi-node scenarios should not be a focus of Tianshou for now, do you agree? I'm not saying it's generally irrelevant, just that in most situations it seems like one node is a better way to go. Clouds offer VMs with 96 cores, with AMD CPUs one can get even more.
Or do you think that with the re-design allowing aggregation we would get significant benefits from multi-node? Most off-policy algos don't aggregate all too much before performing updates in the main process, but some high-throughput algorithms do (impala, apex).
We could consider implementing them together with a redesigned ray worker. In this scenario, multi-node might become useful, and better supporting it would make some sense.
EDIT: after looking at the whole conversation again I am even more convinced that it's not purely a worker-redesign issue, but rather tightly coupled to algorithms and their parametrizations. Not all parametrizations of all algos would be able to make use of either of the options that you presented. In fact, option one is very close to the impala algo, if I'm not mistaken
I agree that any redesign of RayVecEnv
would only benefit a few of the algorithms. Algorithms like Impala can parallelize data collection even when the update process is happening because it uses Vtrace. There's also asynchronous PPO which can take advantage parallel data collection and updating.
I think scaling to multiple nodes gives users the ability to push the boundaries on state-of-the art performance. At the moment Tianshou is designed for vertical scaling (scaling resources on a single node). But horizontal scaling (scaling resources across mulitple nodes) will likely provide the best opportunities for improving performance.
I am still thinking through what is the best way to achieve a good solution for RayVecEnv
.
[!NOTE] One upgrade that may provide a speed boost to all algorithms is double-buffered sampling. This simply has the worker perform a batch of steps on the environment while waiting for the next inference.
I generally agree. We can start looking into this and implementing some multi-node things. However, it's not going to be as simple as changing the workers. The Collector right now is not able to deal with receiving a batch of steps. It will likely be necessary to separate out an EpisodeCollector from the current logic which can collect both steps and episodes in order to deal.with batched workers, since special care needs to be taken when an episode is finished , before attempting to make a suitable Collector. Moreover, we currently don't even have a proper Collector interface...
I propose the following:
experimental.multi_node
where new things can e rested out, like BatchedStepWorker
and Impala and apex Independently of that, the tianshou core team is working on including an automated benchmarking for algos, which will help evaluate the results of step 3. Once those are firmly established, we can move it out of experimental.
Wdyt @destin-v ?
Sounds great, I'll take a look at the double-buffered sampling and send a pull request when it is ready.
I have been busy with papers and a conference but I have not forgotten about this topic. Once I get through my presentations I will have more time to devote to this. My initial experiments have confirmed that buffering affects the Steps per Second (SPS) positively and it improves performance.
I checked with my university employer and it appears that there restrictions that prevent me from directly contributing to an open-source repository. However, I am able to publish papers and open-source code as a researcher for my university. I am working on a white paper describing ways to improve throughput in multi-node settings. When the paper is ready for release, I'll share more details.
Description
The issue with the
RayEnvWorker
is that it is almost guaranteed to be slower than the other vectorized environments because of how it is designed. Tianshou'sRayEnvWorker
(shown below) creates an environment on each Ray worker and callssend()
andrecv()
on the environments tostep
orreset
. When the Ray workers receive astep
orreset
commands it generates an object reference which is sent back to the head node. At this pointRayEnvWorker
callsray.get
to dereference the object. This process involves serialization every time you send an object and deserialization every time you dereference an object. Hence, the communication cost is high. But the actual processing done on the worker is trivial (recall we are just issuing astep
orreset
command). The bottom line is thatRayEnvWorker
should be rewritten to heavily utilize the CPU and minimize communication costs.Options
RayEnvWorker
its own copy of the neural network so it can take actions without communicating with the head node. Once you have enough samples, useray.get
to pull the data over the network.step
orreset
functions. This would make more use of the CPUs and reduce the amount of communication calls.