official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
10.85k stars 2.2k forks source link

New discussion regarding testing bounds, testing book, etc. #2283

Closed adentong closed 4 years ago

adentong commented 4 years ago

There has been some discussion in https://github.com/official-stockfish/Stockfish/pull/2260 about the test bounds, how to optimize for resource usage and how to verify scaling and whatnot. So I opened a new issue for people to discuss. Also relevant is potentially switching to pentanomial statistics, which increases throughput by 10 to 15%, or so I've read in one of the discussions here.

adentong commented 4 years ago

Also since we're mostly done with the tuning patch now I propose we close https://github.com/official-stockfish/Stockfish/pull/2260.

xoto10 commented 4 years ago

I was hopeful about the new test bounds, but with hindsight:

Overall, it seems the tests take longer with no gain.

Various tests since the change indicate that ltc is very different from stc in some areas, to such a degree that stc is not predicting the ltc performance. If stc was slightly longer we would benefit from stc being more representative of ltc performance, e.g. see these figures from the patch from the first big search tune:

10+0.1 th 1: (-3.7 Elo +/- 5.3) LLR: -2.96 (-2.94,2.94) [0.00,4.00] Total: 10152 W: 2237 L: 2362 D: 5553

20+0.2 th 1: (+1.8 Elo +/- 3.0) LLR: 0.57 (-2.94,2.94) [0.00,4.00] Total: 21250 W: 4359 L: 4249 D: 12642

60+0.6 th 1: (+5.1 Elo +/- 4.0) LLR: 2.95 (-2.94,2.94) [0.50,4.50] Total: 12954 W: 2280 L: 2074 D: 8600

180+1.8 th 1: ELO: 12.52 +-2.6 (95%) LOS: 100.0% Total: 19291 W: 3119 L: 2424 D: 13748

Requiring a pass at stc would have stopped us testing this at ltc, but it was positive at 20+0.2.

I suggest we relax the bounds in some way to make stc tests quicker again, and use the extra time made available to increase stc a little.

Note: Perhaps for now the focus should be on Elo gaining patches for TCEC. We could discuss this for a while but leave any testing until the Premier Division is under way (or even the Superfinal, if we make it this time).

Alayan-stk-2 commented 4 years ago

the tests take much longer (obviously)

Yes.

it seems there is an increased chance of merging patches with poor scaling with time (not sure how significant this issue is, and it will always be possible, but I think it is more likely now than under the old scheme)

I agree, I made the same observation months ago.

I'm not sure we have gained anything with the new bounds

With the STC ones, I'm not sure either.

But here are elo gainers which have been merged in SF since june with a LTC test perf below what would have been needed with the old bounds :

2273, #2266, #2252, #2246; #2233, #2207, #2205, #2199, #2192, #2185, #2183

And there are many mores in earlier months since the new bounds have been introduced, of course.

The new LTC bounds made LTC tests often last longer, but helped overall getting a bunch of elo gainers in.

Various tests since the change indicate that ltc is very different from stc in some areas, to such a degree that stc is not predicting the ltc performance. If stc was slightly longer we would benefit from stc being more representative of ltc performance

Yes, I agree.

I suggest we relax the bounds in some way to make stc tests quicker again, and use the extra time made available to increase stc a little.

15+0.15 is probably a more realistic increase for STC than 20+0.2 ; but even then, I don't think you can avoid a resource usage hit.

I doubt you can reduce the game numbers enough to compensate for 50% TC (the old [0, 5] bounds wouldn't be enough).

If you also wish to address the issue of STC being the main obstacle to clear instead of LTC, you also need something less tough to get a green from, so there will be the resource usage of the associated LTC tests to account for (but as many of those are already frequently run as speculative LTC, only in an inconsistent way, I'm not sure it really qualifies as a resource usage increase).

One theoretical idea I had a few months ago was to use 3 stages instead of the current 2, using laxer bounds and overall dropping resource usage (simulations using vondele's tool with some tweaks showed an increase in elo gain per resource used), but the impact on scaling behavior was dubious.

snicolet commented 4 years ago

I also think we could try to relax the STC bounds (and maybe compensate by increasing the STC a bit to keep some part of the filtering quality). When I arrived in the project they were [-1.5...4.5]. It was great, everybody got greens STC all the time and was super happy, it gave momentum to the community.

Alayan-stk-2 commented 4 years ago

I was not aware that STC bounds were like that previously !

This actually makes a lot of sense when considering that the core role of STC is to filter promising patch to save on resource, with only a moderate part on adding confidence.

Maybe STC bounds like [-1, 4] or something similar could work out well.

With noobpwnftw being able to contribute periodically thousands of cores, we could maybe bump STC to 15+0.15 to address the concern of scaling issues ; limit tests in the normal periods with a few hundreds cores to only STC (LTC, tunes, regression tests being submitted, but set to low prio) ; then periodically have the core surge take care of clearing up the queue.

For information, here is how SF from early may scaled with TC on my computer (I didn't standardize the TC to fishtest's 1.6Mnps, but it's not too far off) :

   # PLAYER            :  RATING  POINTS  PLAYED   (%)
   1 Stockfish_90+0.9    :  3147.5   886.5    1200    74
   2 Stockfish_60+0.6    :  3109.1  1682.0    2400    70
   3 Stockfish_40+0.4    :  3053.6   760.0    1200    63
   4 Stockfish_25+0.25   :  2981.5  1205.5    2400    50
   5 Stockfish_15+0.15   :  2891.7   864.5    2400    36
   6 Stockfish_10+0.1    :  2816.7   601.5    2400    25
MJZ1977 commented 4 years ago

There are different discussions on this topic and it is not easy to give a right opinion without separating them :

1- using STC as first test filter for patchs : in most of cases, I think that it is OK. For special patchs with depth dependancy, we can admit that LTC and STC have different behavior. If it is justified in these cases, I think that the best choice is to make a speculative LTC and cut the run if no improvement after 10k or 20k games.

2- current bounds and difficulty to have a green tests : in general STC passes with +1,8 ELO and LTC with +1 ELO. It seems OK for me, but if we cannot find a patch for several weeks, perhaps it is a good idea to change the bounds ... One idea anyway : current parameter tweak [0..4] is too high for me. If parameters are better and we are quiet sure of it, we should change them even for +0.5 ELO

3- long test queue : this issue is connected to amount of ressources we have. If we have >3000 cores, actual bounds are good. If we have <1000 cores, they are too strict and we are loosing time while probable good ideas are waiting almost-sure bad ideas. In this last case, [0..5] STC seems better for me ...

noobpwnftw commented 4 years ago

@Alayan-stk-2 Here are my selections of test positions: https://www.chessdb.cn/downloads/2moves.zip https://www.chessdb.cn/downloads/3moves.zip https://www.chessdb.cn/downloads/4moves.zip

Number of moves are in fact irrelevant and I have decided not to remove "drawish" lines since I think doing so compromises diversity.

Some tests may have shown that mines have more sensibility over the improved 2moves book, but containing lesser positions and higher draw rate, so now there are more positions and I don't think having a higher draw rate is a bad thing - engines need to be tested not to lose on those draw positions, as long as sensitivity is also improving.

Those books need to be thoroughly re-tested though, I suggest after local verification, let's put them on fishtest and run a few regression tests.

EDIT: Links updated.

Alayan-stk-2 commented 4 years ago

Here is a concrete suggestion concerning bounds and TC (leaving the book topic out for now) :

TC

STC duration : 15s + 0.15 instead of 10s+0.1.

We've seen on multiple occasions, especially with xoto's tune, that changing parameters, all the more adding search or eval features, can produce significantly different results depending on the TC.

While this doesn't apply equally to all patches, it is often hard to predict how much it matters, hence the point of a general increase.

Raw nps is still a major concern at 15s+0.15, but behavior is closer to LTC, which in turn should reduce false positives and negatives.

LTC duration : 60s +0.6

It would be great to be able to start tuning and tweaking SF with a longer LTC, but this would be a bigger commitment in resources. If we can keep those sweet 5000 cores at all time, this may be a possibility to explore, but better proceed progressively, being able to test tons of ideas without clogging the queue is nice too.

1. Elo gainers

For the bounds, I suggest this :

STC : [-1, 4]

LTC : [0, 3.5]

This doesn't make a difference between param tweak and patches adding code, because param tweaks are less risky and don't add code complexity, but it's also easier to come up with param changes that are about elo neutral and that would go for extremely long LTCs with tighter SPRT bounds.

LTC tunes results should always have a shot at LTC SPRT. This is already the unofficial practice because it only make sense, but it should be acknowledged as legitimate.

2. Simplifications

Same STC and LTC duration as elo-gaining patches.

Two types of bounds :

STC : [-2.5, 1.5] LTC : [-1.5, 2.5]

(If this is too complicated, just [-2, 2])

STC : [-3.5, 1] LTC : [-2.5, 1]

(If this is too complicated, just [-3, 1])

3. Non-functional changes

These don't need as much care, as it's only a matter of code clarity and speed. In many cases, testing is not needed. When required :

noobpwnftw commented 4 years ago

I just have one question: why are accurate STC tests necessary when people still run "speculative" LTC and VLTC tests anyways?

noobpwnftw commented 4 years ago

It is all about how long people must wait to see their test results, previously was doing fine, then the bounds changed and the overall progress more or less stalled, now given 5x the amount of resources there are some progress and it doesn't look like the queue will run dry very often, why is that? Basically it discouraged people queuing up their tests and the extra accuracy doesn't justify the amount of resources it'd take. I wonder how many CPU hours must be wasted in order to prove that good looking paper theory is an utter failure in practice.

Spirit of fishtest: "I think my change makes perfect sense!“ while most of them failed their tests.

Alayan-stk-2 commented 4 years ago

I just have one question: why are accurate STC tests necessary when people still run "speculative" LTC and VLTC tests anyways?

The only accurate STC in this proposal is for non-functional speedups. Regular elo-gainer STC go from [0.5, 4.5] to [-1, 4] which is less precise and demanding, and should eliminate altogether the regular use of spec. LTC (except when people want to try some high-depth-only stuff). It's too hard to get a green STC right now.

I agree that we don't want tests results taking forever to come or the queue filling up and discouraging new attempts.

Anyway, one major thing which would help is flexibility.

noobpwnftw commented 4 years ago

It's too hard to get a green STC right now.

Then we should start from right here, the rest will solve by itself.

NKONSTANTAKIS commented 4 years ago

The STC is meant to save resources but it seems to be wasting, as not only its too hard to pass but also the patches that pass it need to also pass LTC which is a much different environment (we saw how a -elo STC can be a +12elo VLTC). This has become very rare, thinking we would better off even by testing on LTC only. On the other hand there are many promising ideas but its very difficult to guess the right values or formulas, requiring many tries. Its unrealistic to think we have the computing power to support LTC only.

Most of the progress came from tuning, SPSA with multiple parameters is doing wonders (obviously human intuition is inferior to result-derived values). And we see that only LTC tuning is working, STC tuning always fails. If we could afford VLTC tuning it would surely be awesome for TCEC TC.

So instead of keeping this nitpicking scheme of thorough testing of 1 change at a time, why not to go grand scale: introduce holistic strategy design models (or just a combo of promising ideas), VLTC tune & test them, simplify them afterwards.

NKONSTANTAKIS commented 4 years ago

For improvement of the current scheme I am on the same page with Alayan and noobpwn. For code adders STC [-1,4] 15" sounds good but why not take it one more step and go [-2,5] 20", transferring accuracy to quality. This will reduce the misfortune of filtering out good scalers and the confidence will be derived solely from LTC anyway.

MichaelB7 commented 4 years ago

The conventional, and potentially outdated wisdom, was that evaluation changes that do not impact search , could be tested at shorter time controls ( say STC and LTC) and anything that touches search should ultimately be tested at VLTC or something longer than LTC. Perhaps bifurcate testing parameters so eval changes are tested at STC and LTC and search changes ( we would need to define) would be tested at slightly longer time controls - would think 30+.03 sec and 120+1.2 seconds or just something a little longer than standard. We need to stop running speculative patch runs - but that also requires us to get more greens on the first pass so we don't miss opportunities. A general observation - no test should run more than 100K or 150K max games ( pick a number , but I have seen 200K runs- that is simply not efficient). If we loosen the bounds a tad , then we can add a requirement that it almost pass in a certain number of games. I'm not an expert in this stuff - but some people here are , and it definitely looks that it needs some tuning to make it easier to pass 1st test, reduce max games to xxxx and some changes need to do so some testing long than than 60/.6 , especially those items dealing with search extensions, LMR reductions etc so we don't miss when the search explodes at depth 35 or fail a test that does really great at VLTC. And I say all this, but I'm not saying it is worse than what it was before - but as others have mentioned, it hasn't quite lived up to expectations. Our 10+ year history of making steady Elo gain is still intact!

MJZ1977 commented 4 years ago

As I said before, all depends on fishtest ressources. If we have 3000+ cores we can try longer TC. With less than 1000 cores, it will be very hard.

Just another observation : actually to pass STC we need 0.4% more wins than losses. It supposes that the patch is triggered at least 1% or more to have an effect. So all special positions patchs (like for french defense, shuffle or special passed pawn configurations etc ...) will be very hard to pass. So lowering STC criteria is not a bad idea since we need to improve these special positions.

I also agree with MichaelB7 that making 150k+ games is not so useful. Perhaps there is parameters in SPRT that avoids these situations (other than bounds).

xoto10 commented 4 years ago

There seem to be a number of people wanting to ease the stc elo gainer bounds, either to the previous [0,5] or Alayant even suggests [-1,4]. Does anyone disagree? Can we just change the stc bounds alone, or should we adjust ltc at the same time?

Alayan-stk-2 commented 4 years ago

I think LTC bounds are fine, except for the incoherence between param tweaks and code-adders.

MJZ1977 commented 4 years ago

LTC bounds are fine for me too. We can try STC [-1,4] as suggested, and for parameter tweak also.

NKONSTANTAKIS commented 4 years ago

I agree with all, and also propose LTC for parameter tweaks [-1,4]. They require periodical retuning, are harmless, and smaller gains help too. Can't find any reason to have as high bar as code adders and also to channel a lot of resources for accuracy. Also, for 15" STC any objection?

Alayan-stk-2 commented 4 years ago

[0, 3.5] bounds is quite resource-intensive, and retuning is something that must be periodically done, and so any worse-than-expected change would most likely get corrected the next time the value is tuned, there is an argument for going [-0.5, 4] if not [-1, 4] for param tweaks. However, such bounds are much more vulnerable to someone testing 10 minor variants and getting one through by pure luck though it brings nothing.

For the "not quite good enough" param tweaks, there is already the possibility of using combo patches. A mindful use of combo patches should work well enough. By mindful I mean e.g. for two tune results ; or for a tune result and one of a serie of promising hand tweaks ; but not for random manual tweaks that got a yellow but which were accompanied by several very similar tweaks that all failed and are most likely lucky results.

Now, if two param tweak go to 150K LTC while they would have passed much quicker in a combo or with laxer bounds, we still get a waste of fishtest resources... So there is room for improvement.

With noob's regular 9000 cores hardware injection, it may be useful to have a policy of putting LTCs on low prio so that the regular ~1000-1200 cores can be used for still getting quick STCs result.

NKONSTANTAKIS commented 4 years ago

I can see no drawback into occasionally accepting a param tweak which brings nothing. Why would it hurt? Code complexity is the reason we do resource extensive high confidence LTC, and we are willing to sacrifice tiny elo to simplify it. This should reflect on our bounds, using same [0,3.5] makes no sense whatsoever. On the other hand with [-1,4] we will both catch more small elo gainers and save resources. The faster resolution will allow more tests. Combos are currently a necessity, but also its like admitting we are using too hard bounds. 3 long tests are used when 2 shorter ones would suffice. Also its statistical sloppiness to use yellows as precondition for combos, its like asking an expensive question when we care for a different answer. Some reds would pass the answer we care for, and all yellows and greens would pass the answer much earlier.

The only risk I see with [-1,4] is if people, sweetened by the taste of greens and yellows, get addicted to gambling with param tweaks, thus diverting the focus from the more essential code category. More self-discipline would be required, its gonna be fun.

Alayan-stk-2 commented 4 years ago

@snicolet So, what do we do from here ?

snicolet commented 4 years ago

• I have created a new repository in the official-stockfish github site, so that we can store new books there for testing purposes: https://github.com/official-stockfish

• about the new bounds, I had the curiosity to count the speculative LTC in the last 100 LTC tests for Elo gaining bounds submitted to fishtest: at the date of octobre 19th, the pourcentage for speculative LTC was 70%, from all active Stockfish developers.

I think that this gives some feedback after a few months for https://github.com/official-stockfish/Stockfish/pull/1804#issuecomment-445429885, https://github.com/official-stockfish/Stockfish/issues/1859, https://github.com/glinscott/fishtest/pull/342, and shows that the STC new bounds were too strict for our community since even the most motivated members bypass them.

I shall open a pull request for [-1..5] bounds for STC.

snicolet commented 4 years ago

Pull request opened here: https://github.com/glinscott/fishtest/pull/417

Alayan-stk-2 commented 4 years ago

I'm happy about this initiative.

The only slight fear I have is that [-1, 5] proves too wide (I've seen my share of very lucky & unlucky runs), but as the lowerbound gets lower the actual risk of a solid gainer being rejected doesn't really increase. Vondele's tool indicates that [-0.5, 4.5] wouldn't change much in the end.

Some stats with vondele's tool and its assumed patch elo distribution compared to strict rules now (i.e. no spec LTC) :

In practice, as spec LTC are already very common, there probably won't be a testing cost increase, and the number of applied patches won't increase nearly as much either, but it should be more consistent.

NKONSTANTAKIS commented 4 years ago

A very positive change which saves a lot of wasted STC resources (lengthy STC + spec LTC). The framework will operate much faster, and I suspect that humans instead of hardware will be the bottleneck, enabling a natural transition to higher TC STC. But lets first see that in practice.

snicolet commented 4 years ago

Note that I would be fine with either [-1 , 4] , [-1 , 4.5], [-1 , 5] or even [-1.5 , 4.5] bounds.

ppigazzini commented 4 years ago

cc @mcostalba @vondele @vdbergh @Chess13234 @Vizvezdenec

Vizvezdenec commented 4 years ago

I'm okay with whatever bounds you like tbh.

vondele commented 4 years ago

I think we should indeed reduce the threshold for passing a patch, both stc and ltc, but I think it is a mistake to make the interval wider. Making the interval wider just says that we like to have more noise. [-1, 5] is like [0, 4] with low confidence.

@snicolet I'd rather shift the current STC bounds by 1 elo, e.g. [-0.5, 3.5] and the LTC upper bound by 0.5 to [0, 3].

MJZ1977 commented 4 years ago

@snicolet Lets make a poll with only active users participating and 2 or 3 choices and then decide :-) For example poll STC [-0.5,3.5] or [-1,5] or [-1,4]

NKONSTANTAKIS commented 4 years ago

As we have seen, a lot of stuff behave differently at different TC's. This is why I consider STC confidence untrustworthy, expensive and slowing down the tempo. Our high LTC confidence ensures quality. On the other hand we should not miss unlucky good patches due to wide STC intervals. By reducing the threshold in parallel we accomplish 3 things:

  1. Ensuring that even unlucky good patches will make it
  2. Saving a lot of STC resources
  3. Allowing a random selection of lucky STC runs to be tested at LTC

With the strategy to eliminate spec LTC, we save a lot of resources but lose its most valuable asset, to catch scalers which are weak at STC. Point 3. will allow some without extra cost.

@snicolet Hence out of the suggested bounds [-1.5 , 4.5] seem to be the most suitable for eliminating spec LTC. There is also fishtest experience of those from early years. [-1 , 4] is also good.

vondele commented 4 years ago

I'll post again the link to the SPRT optimization tool, so one can experiment a bit:

https://mybinder.org/v2/gh/vondele/Stockfish/tools?filepath=tools%2FFishtest_SPRT_opimization.ipynb

concerning speculative LTC, it is really for the patch authors to show some discipline. In a few cases, there are good reasons to assume some TC dependence, but actually this is less common than what is often claimed.

MJZ1977 commented 4 years ago

@vondele : your tool seems very interesting. Unfortunately I'am not familiar with it and can't find the way to use it.

After following some links, I have found a comparison that Alayan made some months ago using the tool https://github.com/official-stockfish/Stockfish/issues/1859#issuecomment-453751997

We can see that only changing STC to [-0.5,4.5] increases significantly +1 and +1.5 ELO patchs to pass while +0 ELO patchs probability to pass is still almost null. The tests average cost is higher, probably because of more LTC running but the simulator did'nt take in account the spec LTC I think :-) There is also proposition for 3 stages or 4 stages tests but it is perhaps too much complicated, at least comparing to actual state.

In any case, if the final goal is to keep progressing SF, it seems to me obvious to accept more +1ELO and +1.5ELO patchs because +3 and +4 patchs are more and more rare. We have to make SF progress in special positions which are present only 2 or 3 times in hundreds of games.

vondele commented 4 years ago

@MJZ1977 to use it, you can input bounds to be used for STC and LTC in input cell 8 (_proposed), and evaluate the full notebook (see kernel menu). The pass probabilities are shown in the graphs as a function of Elo of the patch, and various other related quantities are computed as well. The notebook still refers to the old [0,5] bounds for 'now' (this could be fixed editing input cell 5).

I do agree that we need to make sure that 1 Elo patches have a reasonable passing rate.

Alayan-stk-2 commented 4 years ago

As @vondele just said, the tool is rather easy to use. Edit the bounds in the relevant input cells, and you're done. You can also add additional data points in the cells towards the end, I did so when I did my table.

Here are results using [0.5, 4.5] + [0, 3.5] as the reference.

Assuming 1 STC try per patch:

Limits [0.5, 4.5] + [0, 3.5] [0,5] + [0,5] [-0.5,4.5] + [0,3.5] [-1, 4] + [0,3.5] [-0.5, 3.5] + [0, 3] [-1, 5] + [0, 3.5]
-0.5 ELO pass prob 0.0091% 0.037% 0.041% 0.072% 0.030% 0.069%
0 ELO pass prob 0.123% 0.25% 0.433% 0.729% 0.495% 0.616%
0.5 ELO pass prob 1.407% 1.524% 3.726% 5.846% 6.082% 4.534%
1 ELO pass prob 10.02% 7.439% 19.28% 27.08% 33.14% 20.56%
1.5 ELO pass prob 34.22% 24.60% 48.31% 59.84% 69.04% 47.37%
2 ELO pass prob 64.38% 51.35% 73.20% 82.01% 88.60% 69.67%
2.5 ELO pass prob 85.04% 74.77% 87.63% 92.44% 96.11% 83.70%
3 ELO pass prob 94.56% 88.59% 94.61% 96.87% 98.70% 91.64%
total ELO gain ratio 1.0 0.758 1.570 2.054 2.391 1.611
-0 ELO acceptance ratio 0.0091% 0.025% 0.036% 0.061% 0.034% 0.054%
Avg. STC cost 24456 18431 20586 22836 31590 15897
Avg. total Cost (in STC games) 38039 27931 50640 67677 83276 51844

What this table lacks : a proper simulation of how speculative LTC affect elo gaining and resource usage at fishtest, right now. I wouldn't be surprised if the elo gain ratio with spec. LTC is around 1.5 or 1.6 already, but with a worse resource efficiency for this result.

Something that isn't taken into account in all those computations is how valid elo transitivity is for very small gains. I suspect (but can't prove) that 100 patches scoring each +0.1 against the previous master (assuming real +0.1, in practice we can't know the real value precisely enough) would usually give less elo than 10 patches scoring +1.0 against the previous master.

This all also assume that a patch performs identically at all TC, which is an incorrect approximation. This is the main reason I'm skeptical about my own suggested 3-stage testing, besides the fact that complicating the process may create some difficulties. Otherwise, multi-stage testing is unbeatable in the simulations for elo/resource.

Another element that isn't evaluated : the STC results often guide further attempts. Hence, an inaccurate lucky results means more resource will be put into variants of a bad idea, while an inaccurate unlucky result may prevent from trying more to find a gainer. So, generally speaking, of two solutions with similar elo passing and resource usage characteristics, one with tighter STC bounds will be better in practice.

Also, the total costs ratio shouldn't be considered as the total resource usage of fishtest : LTC tunes, and less importantly regression tests, are also a significant part of the load.

MJZ1977 commented 4 years ago

@Alayan-stk-2 : very nice and clear table ! can you please add {STC [-0.5,4.5] + LTC [-0.5,3]} or something like this to see the effect of making a negative bound @LTC ?

NKONSTANTAKIS commented 4 years ago

Long STC runs will be awfully slow. We want to speed up the tempo not lower it. Many versions of the same idea are required to find the sweet spot and many people have many ideas at the same time. That's just impossible with 150K runs, its common sense. I recon @snicolet is fully at this direction and @noobpwnftw also expressed his strong dissatisfaction with tests taking forever to terminate. Many long yellows were retested with [-1,5 , 4.5], none passed and the longest one was a 57K yellow. This indicates its working well. The progress will come through the sheer amount of tries, its impossible and naive to judge the quality of an idea based on an STC result, there is no way of knowing if the idea is good/bad or if the guessed parameters for it are off.

Also the quality of 10+0.1 with so low depths is awful, and so often unrelated to LTC. Many tests pass the difficult [0.5 , 4.5] and fail the easier [0 , 3.5]. By upping the STC the results will be more meaningful. With higher level of chess its more likely that fixes for the spotted strategic flaws of SF show. Its very resource hungry environment, hostile for quality enhancements that require some extra computation.

The change to 15+0.5 is not expensive, the increased quality is more than worth it. Performance will be much more indicative of LTC one. The likes of [-1.5 , 4.5] , [-2 , 4.5] , [-2 , 5 ] , [-1.5 , 4 ] will enable a cheap, swift and continuous selection of high class positives out of a much wider spectrum of tries. In the end it does not matter if a few good ones got lost or a few "bad" ones get promoted. As long as the flow to LTC is kept steady the elo will be rising.

Regarding the LTC, at [0 , 3.5] it feels very solid and secure. Its very expensive but its our highest quality confidence and final call, so its money well spent. It makes sense to make it a tad easier, as everyone noted that progress now comes from small gains, but [0 , 3] would be the heaviest thing we ever saw, no way it could support an increased number of promotions. [-0.5 , 3.5] and [-0.5 , 4] are sensible options.

NKONSTANTAKIS commented 4 years ago

http://tests.stockfishchess.org/tests/view/5dac6b920ebc590eca43e237

0.03 LLR at 70K games with [-1.5 , 4.5]

Question: why is it so important to spend possibly another 150K games (with narrower bounds) to identify if its closer to -1.5 or 4.5 elo? Obviously its very near the middle, 1.5 elo. Its a close call, hence promoting it or not to LTC utterly trivial. Alas, both options are fine and the biggest misfortune is extension of the test.

I don't want 70K games and still being undecided, this makes me think that [-1.5 , 4.5] is not wide enough

Vizvezdenec commented 4 years ago

it's 30% wider than current test so it's obviously converging faster.

Alayan-stk-2 commented 4 years ago

Question: why is it so important to spend possibly another 150K games (with narrower bounds) to identify if its closer to -1.5 or 4.5 elo? Obviously its very near the middle, 1.5 elo.

Not that I like 200K STC runs, but your assumption about the test being "obviously" near the middle is wrong. Plain wrong.

For the linked test which got a +0.96 perf when the test stopped yellow : according to the error bars, there is a 17% chance of it being below -0.25 elo and a 2.5% one of it being below 1.5 elo.

But in practice, we know than when a test has positive results, if these are wrong, it's much more often an overestimated bad test than an underestimated good test. This is because most tested patches are bad/neutral. So actually, the above odds are too optimistic for a random patch showing this performance.

A poor patch is significantly less likely to sustain a +1 elo performance over 150K or 200K games than 70K games.

Wide bounds also make it more likely for an actual good patch to fail, because wide bounds make it more noisy.

That said, because of different behaviors at different TC, I'd prefer not being too strict on STCs, this is the issue the change is supposed to fix after all. I'd like to see [-1, 4] @ 15+0.15. The higher STC would increase testing cost by 15%, which considering the amount of tunes is probably less than 10% resource usage.

NKONSTANTAKIS commented 4 years ago

@Vizvezdenec Yes obviously, and I am happy to see that it terminated at 75K while being 0LLR at 70K, which is a very good sign for (-1.5 , 4.5) tempo.

@Alayan-stk-2 Yes I agree with the data of your analysis but not so much with the interpretation. You say that "it's much more often an overestimated bad test than an underestimated good test" In this context "good" and "bad" is relative to the target. We could for instance consider any patch which is above -0.5 elo at STC as a good candidate for LTC. In our strategy by setting the middle ground on +1.5 elo STC, it is given that most of the tested samples that pass will be lower than that, due to them being more common. But that is perfectly ok and desired. The STC purpose in this strategy is to be used as a filter, for saving resources, and not to attract any confidence by it. If we could afford to test everything straight at LTC it would be better, but as is we opt to do STC selection for cutting out useless stuff. Based on how hard it is to make improvements, I would not categorize a +0.5 , a neutral, or even a -0.25 elo STC patch as obviously useless. I am fine with any elo range I get by setting a practical bar in regards to resource economy and a reasonable % (I'd say between 1/5 and 1/10 out of all STC tests) of LTC promotions. This can be adjusted as desired, raising or lowering the STC filter bar means moving the bottleneck from humans to hardware. Higher STC bar is more work for humans and lower is more work for cpus.

I also realize that for group projects, how important it is for everyone to be happy and in the same boat. Displeased people can disappear, like atumanian who could not digest contempt. Often I question myself if I should stay silent instead, but when I have a strong opinion I have trouble keeping it in. I hope this process is helpful.

NKONSTANTAKIS commented 4 years ago

I have an idea, what if we become more liberal and allow some flexibility of choice to the users?

Suitability of bounds depend a lot on the usage. For example someone who wants to test many versions of guessed values, he could use wider bounds soas to not deprive resources from others. This also suits people with limited time and a lot of creative ideas. Similarly someone who is reserved in testing, with considerative temperament of few and well studied tries could opt for more certainty.

I don't see why the optimal methodology of one strategy should limit the optimal methodology of another.

This also means bestowing trust and responsibility to the users. As long as everyone gets his share of resources, he can be free to use as he considers best from similar concepts. This could make everyone happy.

@snicolet What you think of this option? [-1.5 , 4.5] , [-1 , 4] , [-0.5 , 3.5] are essentially the same regarding ease of promotion to LTC

Alayan-stk-2 commented 4 years ago

For example someone who wants to test many versions of guessed values, he could use wider bounds soas to not deprive resources from others.

If you try many guessed values, and have wide bounds, you get garbage data, because the real elo variations between the different guesses will be typically significantly smaller than the error bars of the test with wide bounds.

If one change values by very small amounts, this is similar to just trying the same STC test several times to get one passing, which goes right against the point of having testing bounds. If one change values by bigger amounts, the big error margin will defeat the attempt to pinpoint the parameter interval where the attempted eval term does best (scaling concerns also hurt here).

NKONSTANTAKIS commented 4 years ago

@Alayan-stk-2 The STC data is not very credible anyway. If the values are too close to each other, I agree that its futile to test many of the same. But the likelihood of the better patch passing is still higher. Adding variance to STC means cutting on STC resources + allow more LTCs. Compare row 3 and 6 of your table. STC [-0.5 , 4.5] to [-1 , 5] both paired with [0 , 3.5] LTC. Similar average total cost resource but much less STC cost for the wider bounds. LTC data are much more credible. In other words, I find that up to an extreme point of randomness, the transfer of resources to LTC is beneficial for the overall quality. I don't think that this point of variance is nearly reached with [-1.5 , 4.5] or even [-2, 5]

Having said that, I understand that [-1 , 4] is a very positive change compared to [0.5 , 4.5] and middle-ground compromise point. But it may not be resolving fast enough to allow a raise of STC. Between [-1 , 4] 10" and [-1.5 , 4.5] 15", I would consider the latter as more promising for LTC success.

vondele commented 4 years ago

@NKONSTANTAKIS can you try to write concise comments backed up with data? Otherwise, your contribution is not so helpful, at least to me. Statements like 'The STC data is not very credible anyway' really are pointless unless they are made precise and backed up with data.

NKONSTANTAKIS commented 4 years ago

@Alayan-stk-2 Seeing your analysis of the drawbacks of the 3-staged system, with which I agree, I am thinking that it partially also extends to the 2-staged. If we were to think of an ideal middle ground 1-staged system, how would that compare resource wise, confidence wise and scaling wise? I think a study on this is interesting.

@vondele Out of the top of my mind the data of a patch STC negative and +12 elo VLTC and the recent V1 patches are not credible? With such an extreme elo gaining example, is it so hard to imagine that we miss many scalers due to STC suppression? If STC data was very credible we wouldn't need LTC's at all, just extreme confidence STC's. This was actually the common methodology of development in the past, to run extreme amounts of games at very short TC's. But now they all turn to higher TC testing. Also in SF there were many representatives of this dogma, advocating that STCs are adequate for everything and LTCs a waste. Luckily mcostalba was a believer of quality in games and raised the LTC from 40" to 60". His words were "The sole purpose of STC is to act as a filter, for saving resources". Now lets also use chess related logic. I don't know if you closely watch SF games and if you adept enough at chess to understand that SF misplays terribly certain positions. Numerous attempts were made to solve them but how is it possible for them to show on game quality of average depth 12, when the flaws of SF's play are very deep? I understand your theoretical/academic/scientific approach and I admire it, it is to step only on solid, well-tested and proven foundations. My personality lies on the other side of the spectrum, for all my life I have taken calculated risks on everything I was doing. Experimenting instead of studying, doing things my own way, rebellious and anti-conformist. Having crossed so often the limits of beneficial risk taking, I developed a feel for it. For me theory is empty, I use the knife to cut the cake. By applying ideas in practice I check their value. I hate conservativism. If I was in charge of SF I would have tried many things, but fortunately I am allergic to tedious work. Due to my love of chess and chess engines, but most of all for my own enjoyment, I am watching and contemplating long hours. Also the stuff that I write take me many hours. I am fully aware that some are annoyed, probably mostly from my temperament, but present them anyway as I have taken positive feedback from people which I value. Its a close call for me, I tend to expose myself emotionally too much, and I have often taken long time off commenting. All in all I am thinking if someone doesn't like it he can ignore it.

MichaelB7 commented 4 years ago

I love your passion @nkonstantakis ... no need to change a thing ... ‘... For me theory is empty ...’ Remember , ”In theory , practice and theory are the same, in practice they are different” 😊

Vizvezdenec commented 4 years ago

sorry I understand your opinions etc but honestly I would like to hear more people that actually write patches. It's pretty easy to give advices when you have neither responsibility nor experience nor data to justify them.