some questions/thoughts on model improvement/progress

hwj-111 commented 2 years ago

The size ( such as 15b, 40b, 60b) is a very important character of a model (network). It seems that when a smaller-size model improvement progress slows down to some level for certain period of time, people conclude that its strength reaches the limit (saturated), and start a new training campaign on larger size model, which supposed to have a higher saturation level.

Another important thing is the visits/move used in the training. Different visits/move definitely will result different models. larger visits/move value result longer training time ( but stronger models, a reasonable assumption?), The visits/move value may depend on the model size. Naturally, larger model size (more complex, higher saturation level, can store much more complicated "strategy of playing go") requires larger visits/move to see" wider horizon and distant future", to match its capability. smaller visits/move would suffer from too many random (unlucky) reading errors, and lead to reach a pre-matured saturation level likely. However, large network size + higher visits/move would make the training data accumulation much slower ...

What visits/move is the optimal value for model training?

Another question is: what is the minimum visits/move value to use when use a model to reflect its true strength ? The question sounds stupid as strength of a model has to be evaluated only at a given visits/move. Based on some data from Leela Zero models, model strength increase ~ 120 ELO by doubling visits/move value. I found Katago's model following this rule too in my own limited tests. But I also noticed that when visits/move low, the ELO difference per doubling-visit/move becomes larger, especially for large size model. In other words, when lower visits/move, large size model becomes weak faster than small size model. This may be explained as models fulfill its trained strength only when visits/s is equal or larger its training visits/move value. It learned all best strategies when it has a view with that visits/move. After visit/move is higher than its training visit/move value, the model strength increase rate is fixed at ~120 ELO every doubling visits/move (a rate of benefit purely due to doubling the model vision) ...

I saw many 50 and 100visits/move katago 40b, 60b models playing on CGOS to measure the modeling progress. Are the visits/move too low to reflect the true strength of these large size model and their progress?

hwj-111 commented 2 years ago

I am thinking a way to judge if a type of model (a given block size) reaches its saturation level (SL) or not, like below: Say the latest trained model is "A" (trained @ m visits/move), and we think it probably reaches the SL. Then we should start to accumulate the training data by doubling the visits/move (i.e. @ 2m visits/move) and continue train this type of models for certain period of time. With enough time, we obtained model "B" ( after generating some 2m models already). Now we let the model "A" play with model "B" at 2m visits. If we see model "B" statistically stronger than "A", then it indicates that the SL of this block size is not reached yet. We need to continue train this block size with 2m visits/move, no need to start to train a larger size model type yet. By doing this, we may also determine what is the proper visit/move value for training a model with a given block size ... Of course, if model "A" strength is statically same as model "B", then we know that doubling the visits/move does not add any noticeable benefits to model (only increasing visits/move helps here), we are confident to move to larger block size ...

HackYardo commented 2 years ago

I'm not major in Computer, Statistics or Math, but I recommend you to read KataGo's paper first, playout cap randomization is a method used by KG, which means there are no a special value of visits.

hwj-111 commented 2 years ago

@HackYardo ， Agree that my assumption of a fixed visits/move value applied to KG is not correct/accurate here ( which was true for leelazero and alphago/zero only). Let me step back a little bit. KG training on a block-size-fixed model uses some method that is very flexible on the visits/move value. Does this mean KG training can find the best visits/move value for the models with that block-size somehow? Visits/move value may play a significant role in preparing training data. "Go" is highly unpredictable when we try to do extrmely-huge-parameter-set model fitting ( optimization). "seeing" 10-move-ahead may give a significant different probability view from 9-move-ahead ... larger model (large block size) may need significant larger visits/move values to match the needs of fitting larger parameter size. A model show stagnation may be caused by two factors: 1. the model's parameter set indeed reaches its optimal (with enough data accumulated) or 2. the data generated with a too-small-visits-per-move value cannot provide enough complexity to match the model parameter potential ( low visits/move value give more random reading errors which is not good in fitting huge parameter set, IMO)

Chenvincentkevin commented 2 years ago

@hwj-111 "Now we let the model "A" play with model "B" at 2m visits." what does that mean? I think the po should be the same

Chenvincentkevin commented 2 years ago

and I think it's proper to increase po in training in the early nets such as B6, B10, etc. But I'm not sure whether or not author did this kind of experiment before.

hwj-111 commented 2 years ago

@hwj-111 "Now we let the model "A" play with model "B" at 2m visits." what does that mean? I think the po should be the same

"A" is obtained by the data of roughly m visits/move while "B" by the data of ~2m visit/move as I mentioned above. When these two models match, they play at 2m visits/move. The results will tell us if doubling the (~2m) visits/move during training improves the model strength more than the benefit of merely doubling the visits/move at play.

This is based on an assumption (I think it is reasonable) that the larger (in term of block size) the model, the more average training visits/move is needed to fulfill the model potential (before we call it saturated)

I have some data collected from cgos which is reported at https://github.com/lightvector/KataGo/issues/576 (updated once a while). The data show one thing clearly: with 50 and 100 visits/move, we can still see some progress in 40b models. However, we almost cannot detect any significant progress in 60b models( especially for the v50 group). It does not necessarily mean 60b models did not get improved. It might be due to v50 and v100 is too low to detect its true progress ( it suffers from low visits/move "randomly" reading errors much more than models with smaller block size)

lightvector / KataGo

some questions/thoughts on model improvement/progress #657