lightvector / KataGo

GTP engine and self-play learning in Go
https://katagotraining.org/
Other
3.52k stars 565 forks source link

Why does superman-level AI become incredibly weak in the face of higher-computing AI? #356

Closed HackYardo closed 3 years ago

HackYardo commented 3 years ago

IMG_20201126_005838

In winter, temperatures are low, electronic devices dissipate heat faster, I tried KataGo20blocks6400visits : GolaxyTigerLevel(equals to KataGo20b150v), the former(white) won by occupying almost the entire chessboard, it is weird why the superman-level AI will become so vulnerable?

portkata commented 3 years ago

Are you sure the tiger is = to the 20b with 150 visits? I beat the tiger with the katago 40b with 3 visits (1 thread). I thought it was around 6 dan if we are thinking about the same bot. The golaxy free pro 3 star bot? It would be good if somone put that bot on ogs or kgs. I think I beat the 6d kgs hirabot with the tiger i remember right.

HackYardo commented 3 years ago

@portkata When and how many times did you do the test? The Tiger's ELO(GoRatings) is 4000, it is higher than any people on the earth.

portkata commented 3 years ago

i just played one game with katago 40b 3 visits vs the tiger. kata was black and won. i know golaxy says that its elo is high. But how do we know it is really 4000? I think it is weaker than any pro, but i hope you are right! That would mean golaxy offers almost the full range of human play for free. The ant is probably 17k. i like how golaxy plays a much less attacking style. Too bad it is not open source.

lightvector commented 3 years ago

Someone ran KataGo 1 visit for a while on KGS and its rank hovered around 8 dan KGS amateur. So I would somewhat doubt that 3 visits is already superhuman, 3 visits will not improve it that much. If the raw policy is high amateur dan, then probably KataGo 40b 3-5 visits would at best be somewhere weak to middling pro level.

Saying that the Elo of sometihng "is 4000" is a bit unclear. In what system? Numbers in one system will often have no direct relationship to numbers in another system. Elo ratings are only as reliable as the data and the methodology that was used to compute them and sometimes even when computed with good data and methods, will still not entirely generalize outside of a particular system or opponent pool.

portkata commented 3 years ago

Wow 8 dan on 1 playout. Probably after a year of distributed training, katago with some randomization formula will be able to cover the whole range of amateur human play with 1 playout.

HackYardo commented 3 years ago

When Lee Sedol fought AlphaGo, we can see Lee Sedol's toughness, but when two different-computing AI fight against each other, it seems that the lower-computing one is as brittle as glass and lacks toughness, so why?

HackYardo commented 3 years ago

Saying that the Elo of sometihng "is 4000" is a bit unclear. In what system?

I don't know how does Golaxy calculate its ELO, but I guess it use the same system to: https://www.goratings.org/en/ The method is here: https://www.remi-coulom.fr/WHR/ In this system, one new pro player will have 3200-3400 ELO, and one top player will have 3600-3800 ELO, these numbers are similar to Golaxies' numbers. However, estimating a superhuman-level AI's ELO is hard, the chess AI Stockfish has many different ELO on Wikipedia.

lightvector commented 3 years ago

I don't know how does Golaxy calculate its ELO, but I guess it use the same system to: https://www.goratings.org/en/

That's my point - if you have to guess, then the Elo number is already not very meaningful. You can't guess, the details matter.

It's a common misconception that Elos are some magical absolute numbers that are "out there in the real world" that you can measure. They really aren't - in general they're specific to a set of games. Using the exact same algorithm and method to calculate, player A and player B can be 300 Elo apart in one case and 400 Elo apart in another, purely because the opponents that they played against were different (again - using the same algorithm). Because saying that A and B are "N Elo apart" really just means "on average, against a particular set of opponents or in a particular set of games, taking into account the opponents' own performances, player A's odds of winning were on average X times more favorable than player B's", for some value of X. If you change the set of opponents, or the set of games, "X" can absolutely change, which means that the Elo difference between A and B you report will be different.

What is the "true" Elo difference between A and B then? There is no such thing. Player A and B each have different winning chances against each other and against different opponents, and also changing over time. If you change the opponents, you will measure a different difference.

This becomes especially pronounced when you add in computer players, because computer players often scale a little differently than human players at different levels. It's not terribly uncommon in computer game programming to make a series of improvements that add 100 Elo against some other computer opponents, but against some real human players you find it only adds 50 Elo, or some other number very different than 100. Because, again, Elo is relative to the opponent you test against, or the data that you use.

Does that make sense?

So if you did not know the system or the data that the "4000" was derived from, what opponents were involved, then it means something, perhaps it is a very careful and reliable measurement of something meaningful, but you yourself can't sure what it means. And I'm pretty sure you can't guess that the system is exactly the same. After all, Golaxy is not listed on https://www.goratings.org/en/ - so clearly it is not part of the same rating system with the same data, which is being used to compute the ratings on that site. :)

lightvector commented 3 years ago

And maybe to answer your original question - why is there some "fragile-like" behavior sometimes like this?

If you stick around in general in the land of computer game playing and rating systems, you'll find that actually this kind of thing is not really that surprising. Computers "think" differently than humans do, and often very differently than each other, due to differing algorithms. When you try to compare two computer AIs versus each other you can sometimes see very different results than when you try to measure them versus humans. And sometimes the even the computer results measured versus humans will also be noticeably different than human differences versus other humans. "Nontransitivity" or "nonlinearity" of strength exists to some degree in any practical situation, but it can be often larger in many games when computer AI opponents get involved. There is no reason to expect that algorithms that think every differently than humans will have the same "scaling" of strength in different situations.

Which also the cause of things like this:

It's not terribly uncommon in computer game programming to make a series of improvements that add 100 Elo against some other computer opponents, but against some real human players you find it only adds 50 Elo, or some other number very different than 100. Because, again, Elo is relative to the opponent you test against, or the data that you use.

So this means that when computers get involved, you should be a little cautious about assigning too much significance to any ratings (they are much less meaningful than you think they are, especially if you don't account for exactly what data they were computed from), or in general about having major expectations about it will do relative to other opponents, or in handicap games, or in different time controls. These things do often behave normally and predictably, but sometimes they don't.

HackYardo commented 3 years ago

@lightvector Thank you for your detailed answer! AI defeats humans on chessboard, but it can't defeat humans in every way, even on chessboard, humans have something AI doesn't have!