lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.47k stars 563 forks source link

Does Katago support cross machine search for analysis mode? #760

Open Vincentwei1021 opened 1 year ago

Vincentwei1021 commented 1 year ago

Hi, I'm curious about whether Katago supports cross machine search(which means multi node multi gpu) for playing match or analysis mode?If not, any advices on which part of code should be modified to have this feature? Thanks in advance @lightvector

OmnipotentEntity commented 1 year ago

KataGo already support multiple GPUs on a single machine. For multiple nodes, you'll need to set up a some sort of tree search server, which sends positions to nodes, which then processes the position and sends the updates back. The parts of the code you'll need to edit to make this happen is /cpp/program to make a new subcommand for master and client (you'll need to define some sort of communication protocol, probably json based, because there's already json support in katago), you'll also need to edit files under /cpp/search in order to create an additional kind of SearchThread that handles remote dispatch.

I would imagine it to be a reasonably large undertaking, but probably a straightforward one.

Vincentwei1021 commented 1 year ago

That sounds complicated to me. Do you know any existing work that I can follow or even reuse? Thanks @OmnipotentEntity

OmnipotentEntity commented 1 year ago

No. To my knowledge there is no open source code or libraries that provide something like that seamlessly. Additionally, I'm not sure there could be something that would be widely applicable. What is your use case exactly? Do you actually have a few servers that you want to build a single distributed go engine for?

Vincentwei1021 commented 1 year ago

Yes, I have a few servers and would like to combine their power to play match or analyze.

Vincentwei1021 commented 1 year ago

So any plan to support this feature in the future? @lightvector

thynson commented 1 year ago

The distributed MCTS could be too complicated to implement, instead, implementing a virtual-GPU backend that distributes works to the cluster may be more viable.