lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.49k stars 564 forks source link

training speed for H200 #940

Open horos22b opened 4 months ago

horos22b commented 4 months ago

all,

With modern hardware, what training speed would one expect with a single H200 GPU, ie: how fast could you go from zero to the current ELO?

I am curious to see how the ending ELO would change given different starting conditions for the network weights and the training history, as well as how much different the ending playing style would be in variants of the networks that result. Before that would be prohibitively expensive, i'm hoping that the increase of speed changes that.

(note - just realized the H200 hasn't come out yet but the same goes for the H100, how much does that change the math on doing the above experiment)

lightvector commented 4 months ago

How much faster is it? I would think that it would be many, many years. Even if it's a good GPU, one GPU isn't much.

But if you merely want to get a network to human pro level or a bit beyond it rather than to match current bots, it's already the case that one RTX 30xx gaming GPU is probably enough to get there in several weeks to a few months, and of course proportionally less if you have more than one GPU. So with existing hardware it should already be practical for people to train from scratch to reach a very strong level, and to do some experiments of the kind you mention.

horos22b commented 4 months ago

David,

Ok I'm doubtful, given the specs shown here:

https://www.reddit.com/r/LocalLLaMA/comments/188boew/multiple_4090s_instead_of_h100/#:~:text=4090%20is%20made%20for%20playing,for%20memory%20and%20compute%20efficiency .

In short, from that post, an RTX4090 has a FP16 performance of ~150~300 TFLOPS, whereas the H100 has a performance of around 1000 TFLOPS. Drop it down to FP8 and that performance increases to around 2000 TFLOPS for the H100. So as long as KataGo doesn't use 32 bit precision operations, that's an order of about 8x the performance of a high end gaming GPU. Also, the memory bandwidth is around 3 times an RTX 4090, can be linked together with other H100 cards to work almost as fast as a single GPU.

So again, I'm not so sure. Has anyone benchmarked it? Given how much RAM a given H100 has, one thing I'd think possible would be to bulk train multiple KataGo instances, ie: stripe them across the GPU and have training runs done in parallel.

Ed

On Sun, May 12, 2024 at 6:46 PM David J Wu @.***> wrote:

How much faster is it? I would think that it would be many, many years. Even if it's a good GPU, one GPU isn't much.

But if you merely want to get a network to human pro level or a bit beyond it rather than to match current bots, it's already the case that one RTX 30xx gaming GPU is probably enough to get there in several weeks to a few months, and of course proportionally less if you have more than one GPU. So with existing hardware it should already be practical for people to train from scratch to reach a very strong level, and to do some experiments of the kind you mention.

— Reply to this email directly, view it on GitHub https://github.com/lightvector/KataGo/issues/940#issuecomment-2106489681, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABL3VQTCOVI4J4IPKY2J7E3ZCALOTAVCNFSM6AAAAABHTFOQWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBWGQ4DSNRYGE . You are receiving this because you authored the thread.Message ID: @.***>

lightvector commented 4 months ago

Thanks. I guess I don't know what you're doubtful about? Both of the things I've said sound like they're still true. 8x isn't very much to compete with a run that's gone for years with contributors who have collectively donated a lot more GPUs at a time than 8, so it should still take years if you do insist on catching up all the way to the current run. And even if an H100/H200 would be faster, it's also still true that a gaming GPU right now is enough to do meaningful experiments, as I just said it's already possible for existing users to train KataGo from scratch to top human strength in practical amounts of time, and getting to top human strength is a much easier target and is enough to still be able to research some meaningful things about training.

lightvector commented 4 months ago

Anyways, the GPU you mention sounds impressive and would certainly make doing experiments faster and easier. I tend to think of it all on a continuum - saying that it's "prohibitively expensive" to do research and then suddenly one new GPU release makes it suddenly not "prohibitively expensive" feels a bit too discrete. Rather, right now AlphaZero-like training is already possible for individuals to replicate in many different cases, even in the full game of Go, enough to do research already. But yes, each new improvement in hardware makes everything a bit faster.

horos22b commented 4 months ago

David,

I'm going off of this paper:

https://arxiv.org/abs/1902.10565

I'm assuming. you are the paper's author?

In particular the summary:

By introducing several improvements to the AlphaZero process and architecture, we greatly accelerate self-play learning in Go, achieving a 50x reduction in computation over comparable methods. Like AlphaZero and replications such as ELF OpenGo and Leela Zero, our bot KataGo only learns from neural-net-guided Monte Carlo tree search self-play. But whereas AlphaZero required thousands of TPUs over several days and ELF required thousands of GPUs over two weeks, KataGo surpasses ELF's final model after only 19 days on fewer than 30 GPUs. Much of the speedup involves non-domain-specific improvements that might directly transfer to other problems. Further gains from domain-specific techniques reveal the remaining efficiency gap between the best methods and purely general methods such as AlphaZero. Our work is a step towards making learning in state spaces as large as Go possible without large-scale computational resources.

Those were 30 GPUs in 2017, right, and the rough performance of a V100 was 13.5 TFLOPS FP32 (27 TFLOPS for FP16) so that's a rough multiplier of 100 right there in training speed for the H100. As far as RAM goes, 188GB RAM vs 16GB, so a factor of 10 or so.

So getting to ELF's final model should take considerably less than the 19 days stated in the paper given the same training resources. At least that would be the hope.

In any case I think it is worth doing a benchmark. Looking online I see that an H100 can be leased for ~2 dollars per hour, so about $50/day, or maybe $500 for a full training run. Since the 'regular runs' were around 5 months on legacy hardware, even assuming parity you could do a similar run on 1 H100 for about $1500/month or $7500/run.

anyways, i'd be willing to do such a benchmark (the original run) assuming that doing such a run is pretty much scripted out of the box and there isn't too much in the way of manual work needed to set it up. Questions though -

  1. how much RAM does the average katago run use on a per-process basis?
  2. how bottlenecked is any given run on CPU vs GPU?
  3. how parallelizable is a training run?
  4. can katago use FP16 or even FP8 in doing its training? if so how much did the lower resolution affect the ELO?

Ed

(ps - I see in retrospect that I didn't answer your original point. ie: "Rather, right now AlphaZero-like training is already possible for individuals to replicate in many different cases, even in the full game of Go, enough to do research already.". Normally I guess I'd agree with you.

But I'm most interested in seeing how much variety there is across training runs, which is a different question than looking at the ultimate result of a specific training run. Hence I'd need to have - if possible - hundreds of samples, all starting from a given beginning state and given the same amount of training effort. I'd need some consistent metrics to compare them and quantify their variety. Furthermore, each network would need to run for a long period of time to make sure that most inefficiencies had been eliminated and that they were starting to converge.

Ultimately, my guess is that they will be wildly different in tactics and strategy but with some core similarities, but that is just a guess (one that I'm making based on studies of identical twins and the wide variety of go games) but it would be a very basic question to answer. we can't run the actual world multiple times after all, but we can simulate it multiple times from beginning starting conditions, and this would be a nice artificial petri dish to study such topics. )

On Mon, May 13, 2024 at 6:33 AM David J Wu @.***> wrote:

Anyways, the GPU you mention sounds impressive and would certainly make doing experiments faster and easier. I tend to think of it all on a continuum - saying that it's "prohibitively expensive" to do research and then suddenly one new GPU release makes it suddenly not "prohibitively expensive" feels a bit too discrete. Rather, right now AlphaZero-like training is already possible for individuals to replicate in many different cases, even in the full game of Go, enough to do research already. But yes, each new improvement in hardware makes everything a bit faster.

— Reply to this email directly, view it on GitHub https://github.com/lightvector/KataGo/issues/940#issuecomment-2107593196, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABL3VQXLBU2JU4VGQZ4HUFTZCC6KBAVCNFSM6AAAAABHTFOQWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBXGU4TGMJZGY . You are receiving this because you authored the thread.Message ID: @.***>

lightvector commented 4 months ago

Yes that was my paper. Are you only interested in getting to ELF's strength, or only to the relatively weak strength we reached in the paper? In your initial post, you didn't ask for that, you asked for:

how fast could you go from zero to the current ELO?

Which is what I said seems like it would take many many years, (aside: although if an H100 really is a 100x speedup over a V100 for practical uses and it's not just marketing claims from an overly ideal benchmark, then it would perhaps it would be something like many many months, rather than many many years).

That's why in every reply so far I've been continuing to emphasize that its's worth first considering more carefully your goal and what level is sufficient. ELF is vastly weaker than modern bots, far lower than "current ELO", so if you would be satisfied only to reach ELF's level, then of course it's cheaper. Remember, the stuff you've cited is way back from work done in 2018/2019. It's 2024 now, so of course the top bots have already gone way further.

Whether you draw the line at "just beyond top human level", or at "ELF's strength", or "current Elo", or even "100x beyond any level we've reached yet", is ultimately a little arbitrary in the sense that ALL of them are quite far enough into runs to meaningfully observe scaling trends and to do useful research. Each successive one is a little further along in scaling and marginally better for such research, but because strength is ultimately logarithmic in compute invested, each one costs exponentially more. If the kind of research you imagine could already be done at a weaker level rather than "current Elo", then you don't need an H100. Or you could use an H100 still, but do the research massively faster.

So because the gains are only incremental while the costs are exponential, it's extremely important to be clear about what level you wish to reach that would be sufficient for your research.

It's often the case that when hardware gets 1-2 orders of magnitude faster, software needs to be refactored or rewritten a little to take full advantage of it, and and KataGo's repo is no exception. If you're interested in doing the work and/or contributing improvements, you're welcome to look at https://github.com/lightvector/KataGo/blob/master/SelfplayTraining.md. The closest thing KataGo has to a single-script that runs everything is the "synchronous_loop.sh" script documented there, and with further documentation via inline comments and configuration: https://github.com/lightvector/KataGo/blob/master/python/selfplay/synchronous_loop.sh. The script by default is configured in a way that probably makes more sense for a casual user on a single RTX 20xx or 30xx, so one would probably need to test and adjust a lot of things, and probably switch to asynchronous training once comfortable with the greater complexity of setting that up, since asynchronous is more efficient.

lightvector commented 4 months ago

By the way, if you're serious about doing hundreds of runs to study the variety of play that results, let me know how I can support you or if you have specific questions I can answer. I think it would be a cool thing to study.

Note that we do have some data about how different runs already. In the past we've seen many different independent reproductions of AlphaZero-like training - such as ELF, Leela Zero, MiniGo, Sai, and KataGo. All of them ended up with relatively similar opening styles, and all of them ended up pretty similar overall in common tactics.

So I think we can already rule out the idea that they will be "wildly different in tactics and strategy but with some core similarities". Rather it's more likely they will be all extremely similar, but with subtler differences in some specific situations. But the differences could still contain a lot of interesting things to study! For example the various bots differed nontrivially in how fallible they were in specific complex sequences, such as "Miyting's flying dagger joseki". Additionally, KataGo has a ladder solver, but if you disabled it you could study ladder learning, which is a strange beginner-level tactic that AlphaZero bots find difficult to learn if you don't have a solver for it.. I recall ELF and Leela Zero ended up with relatively different on whether they tended to err in thinking ladders would work, or whether they tended to err the other way, and I don't think anyone knows why or whether those differences are random or reproducible.

horos22b commented 4 months ago

David,

Thanks for the thoughts - I'm pretty sanguine that the H100 and H200 aren't just marketing hype - after all, all of the chatbot activity and progress surrounding it is due to these GPUs - if you could call them GPUs which is a bit of a stretch, afaict they are more like supercomputers on a circuit board - and there is an incredible amount of pressure to further improve them. Whether or not they are readily available to people outside of a small clique of large companies is another question, I did a little research, and each one of the places that supply them for lease make you go through an intake interview before they let you access them, so it might be some time before they become truly available to the general public.

As for similarity vs difference, I think that that can be automated - if katago exposes its thoughts on a position, you could train to various intervals, do self-play at specific points, and pick out specific positions at different points in the game where the bot shows a large number of potential moves in its initial eval. You could then compare various instances to see how much variability there is in their evals and track this variability over time as the bots mature.

You could also track specific patterns in play (assuming you have a pattern matcher) like ko frequency, ladder frequency, joseki frequency, So yes, lots of stuff to study. And see if there is variability in ELO at the endpoint. And if there is, pick a couple of 'winner' bots to train further and see if that ends up with an overall boost in ELO.

So yes, tons of stuff to study, and now that I'm thinking about it, most of the framework to support this research could be done with a legacy GPU, and ported to a H100 later. But it only becomes truly interesting IMO if the bots get strong enough, hence my original question.

Ed

On Wed, May 15, 2024 at 5:54 AM David J Wu @.***> wrote:

By the way, if you're serious about doing hundreds of runs to study the variety of play that results, let me know how I can support you or if you have specific questions I can answer. I think it would be a cool thing to study.

Note that we do have some data about how different runs already. In the past we've seen many different independent reproductions of AlphaZero-like training - such as ELF, Leela Zero, MiniGo, Sai, and KataGo. All of them ended up with relatively similar opening styles, and all of them ended up pretty similar overall in common tactics.

So I think we can already rule out the idea that they will be "wildly different in tactics and strategy but with some core similarities". Rather it's more likely they will be all extremely similar, but with subtler differences in some specific situations. But the differences could still contain a lot of interesting things to study! For example the various bots differed nontrivially in how fallible they were in specific complex sequences, such as "Miyting's flying dagger joseki". Additionally, KataGo has a ladder solver, but if you disabled it you could study ladder learning, which is a strange beginner-level tactic that AlphaZero bots find difficult to learn if you don't have a solver for it.. I recall ELF and Leela Zero ended up with relatively different on whether they tended to err in thinking ladders would work, or whether they tended to err the other way, and I don't think anyone knows why or whether those differences are random or reproducible.

— Reply to this email directly, view it on GitHub https://github.com/lightvector/KataGo/issues/940#issuecomment-2112444124, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABL3VQU3B6SHNA3TVSPIJMDZCNLJJAVCNFSM6AAAAABHTFOQWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJSGQ2DIMJSGQ . You are receiving this because you authored the thread.Message ID: @.***>

lightvector commented 4 months ago

hence my original question.

I don't know what your original question is, beyond "how fast could you go from zero to the current ELO?" and I've answered that. But your subsequent posts have indicated that you might have conflating "current ELO" with ELF's level, and so on. If all you care about is studying style differences, ELF's level, though vastly weaker than KataGo now, should be plenty. Regardless of H100 or no, it would be extremely wasteful of resources to replicate the entire current run repeatedly. As I mentioned before, I would guess that even barely-stronger-than-humans-level might already be an interesting point for studying style evolution and distribution, and barely-stronger-than-humans-level is already reachable on home consumer GPUs and if that were enough rather than ELF-level, it would be another large multiplicative factor cheaper and faster on top of whatever GPU you use.

lightvector commented 4 months ago

Anyways, if you're interested, go for it!

GPU strength is very likely non-blocking for bot-playing-style research, it's more a very-nice-to-have rather than something critical. Even if you lose a factor of 20 in GPU strength relative to what you wanted, that just means you perhaps train each run 7x less far, and do 3x fewer samples, and voila you've made up a factor of 20. And given that Go bots are already massively massively overshooting top human strength and are all so converged that they all arrive to very similar high-level opening preferences and tactics and agree with each other on most positions, and also due to the fact that strength gains are logarithmic (i.e. highly diminishing-returns) in compute anyways, a factor of 5x-10x on a run is probably okay to give up for this kind of purpose, even if it's very nice if you have it.