How can i training with multi computers?

kengz / SLM-Lab

Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

https://slm-lab.gitbook.io/slm-lab/

MIT License

1.25k stars 264 forks source link

How can i training with multi computers? #400

Closed lidongke closed 5 years ago

lidongke commented 5 years ago

Hi~ How can i training with multi computers?I have not seen where i can set the address to connect?Is there the "distributed" in the spec json can work for this? @kengz

kengz commented 5 years ago

Hi @lidongke this is currently not a feature; the lab meant to run within a single machine, although that can already be quite big. Multi-machine is a use case that the lab has not encountered, so you'll likely need to write custom code to modify or import the lab. We do not plan to support this soon, but here's a reference to get you started https://pytorch.org/docs/stable/distributed.html

lidongke commented 5 years ago

I see that the ray can work for Multi-machine. Do u know if it easily to add them to lab?

kengz commented 5 years ago

You can probably start with the ray documentation https://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html , setup the cluster machines, and pass those cluster configs into the ray.init(...) calls in SLM Lab. Note that this means the parallelized runtimes are distinct Trials and so will contain different instances of an algorithm. If you're trying to say run a massive Hogwild parallelization on 1 algorithm with many workers across multiple machines this is not the use case.

lidongke commented 5 years ago

That is correct,i'm trying to run 1 algorithm with many workers by multi-machine, it will increase the sampling efficiency.

lidongke commented 5 years ago

you can close this issue, thanks!