Simplifications and reintroductions

unaiic commented 4 years ago

There is a concern about simplifications: before the changes in bounds, simplifications were rather easy to do, as they passed quickly. Therefore, even good patches could be simplified, but when they were retried, they passed again. Now, even with the stricter simplification bounds, there is some discussion regarding this. As happened in #3069, tests passed simplification bounds very easily. Despite that, it is not clear if it is a good simplification, as it was reintroduced (after being simplified) ~20 days ago (IMO it doesn't matter when it was introduced/simplified). The opposite happened in #3072, where a simplification happened some days ago, but after testing the removal it passed quite easily.

As a result, it is unclear whether simplification bounds are stricter/trustworthy enough. On the one hand, some ELO gainers can be simplified easily (maybe rightly, and after other changes such ELO gainer does not perform as it did); on the other hand, some recent simplifications can be reverted and still gain ELO (again), thus meaning it wasn't a good code removal.

Could we arrange something to make these simplifications/reintroductions less "noisy"? We should be able to test something and, if it passes, merge it without wondering whether it is a good removal/reintroduction, and regardless the time passed; we should be confident enough that such a change will benefit SF.

mstembera commented 4 years ago

Would it make sense to make elo gaining bounds and simplification bounds one and the same? What I mean is when simplifying something away you would run master as an elo gainer against the patch to prove that master is still an elo gain over the simplification.

unaiic commented 4 years ago

@mstembera So your idea consists of running the tests with the simplification as the base and master as the new one, and if it passes it'd mean that the simplification is not good because it'd have lost ELO, right? Thus a failed test would be a "good" result for the patch to be merged. IMO it's a good idea.

vondele commented 4 years ago

I'm somewhat puzzled by the results of these tests, the current bounds would make successful addition followed by successful removal quite unlikely.

Basically, a feature that has about 35% chance to be added (0.75 Elo), has only 2% chance of being removed. Even a feature that only has a 2% chance of being added (0.25 Elo), has still only 35% chance of being removed. At 0.5 Elo a feature can be added or removed with the same likelihood (10%).

This assumes, however, a single test of that feature is being made both at STC and LTC, which is not quite reflecting reality.

For simplifications it is important that the likelihood of passing for a 0 Elo chance is large (currently 70%).

unaiic commented 4 years ago

@vondele True, it is somewhat strange. IMO a test such as #3072 should have no problem or discussion: a test passed both STC and LTC bounds, and even though it was removed some weeks ago, recent patches may have changed its importance. Regarding #3069, however, I don't get the point at all. It passed simplification bounds very quickly with positive ELO (+0.5 and +0.3), but only one patch was merged after its reintroduction (I don't know which is the probability of a term being simplified "by luck" having such ELO finishes, but I guess it is very low). Maybe that patch was just "lucky", or maybe the other patch made this one less relevant and easier to remove. If this wasn't the case, it'd be just a coincidence or a sign that bounds are not that good for simplifications, which after the numbers you've given I strongly doubt.

MJZ1977 commented 4 years ago

I have the feeling that the SPRT simulator (https://tests.stockfishchess.org/html/SPRTcalculator.html?) is somehow biaised in comparison with fishtest reality. In particular the probability of having a several lucky/unlucky runs in a row. There should be some simple tests to verify the pass probability given by the simulator.

vondele commented 4 years ago

The simulator has been extensively verified by @vdbergh I don't doubt it is right for the underlying assumptions.

I wonder if there are underlying assumptions that are not valid on fishtest right now, i.e. that the confidence is lower than we expect for whatever reason. At least in the past, we have measured quite carefully that fishtest is not biased (i.e. master1 vs master2 test gives 0Elo).

Edit one more bias test: https://tests.stockfishchess.org/tests/view/5f4d05c7ba100690c5cc5d5a

vdbergh commented 4 years ago

Let's see what the test does. Of course there is always Murphy's law. In the past there have been tests of master against master which gave a significant result (the example I remember is from very long time ago).

I wonder if Elo has not become slightly fuzzy due to hardware differences. If a patch is worth 0.5 Elo for one hardware mix and 0 Elo for another then one may expect counter intuitive results.

vondele commented 4 years ago

I doubt 'the hardware mix' is responsible for things we see for most of the patches, the mix is mostly important for SF11 vs SFdev testing. However, I wonder what would happen if e.g. frequency scaling happens on a per-core level and if that is not the same for each core of the CPU. This would (randomly) affect the executables on those cores (10% frequency change, as is common, would be a very significant impact). Can you speculate what the effect would be of that?

vdbergh commented 4 years ago

I am relatively sure this would be harmless from the point of view of confidence. This kind of variation would create (unmodeled) noise. Thus the measured variance would increase and hence tests would become harder to pass. This is good for confidence (but bad for the efficiency of fishtest).

MJZ1977 commented 4 years ago

Yes, my doubts are more about asumptions that are not so easy to modelize in theory. So that's why tests can be useful. For example 100 tests master vs master with large bounds [-5,0] and VSTC. It will not be a demonstration, but at lease we have an idea of the reality.

vondele commented 4 years ago

I am relatively sure this would be harmless from the point of view of confidence. This kind of variation would create (unmodeled) noise. Thus the measured variance would increase and hence tests would become harder to pass. This is good for confidence (but bad for the efficiency of fishtest).

But what if this noise is not uncorrelated in time, e.g. the same process is bound by the OS to the same core?

vondele commented 4 years ago

@MJZ1977 can you summarize the results of the tests from your point of view?

For me, we have 10/10 tests with the expected outcome (consistent with 98.8% pass, for 0Elo and {-5,-1}). At the TC 2+0.02, the draw rate is 0.614, RMS bias 47, so we expect 10.6K games on average, we have 8.9K. I would consider the latter to be close to the expected result (not sure what the uncertainty on the mean should be, the distribution of #games is probably not really Gaussian).

MJZ1977 commented 4 years ago

I planned to make a summary, but it is exactly what you wrote :-)

The most important for me is that the tests had a pass probability of 98.8% and that 10 tests / 10 tests passed. So even if we are not sure of the 98.8%, the non-pass probability is probably very low.

My conclusion is that the tests are consistent with SPRT similator. It is not a perfect demonstration but nothing that shows a contradiction. If test passed with 1% chance, it is probably very lucky ...

vondele commented 4 years ago

Also the bias test is nicely consistent with the expected 0 Elo:

ELO: 0.07 +-0.9 (95%) LOS: 55.9% Total: 200000 W: 38624 L: 38584 D: 122792 Ptnml(0-2): 3502, 22898, 47109, 23040, 3451

vondele commented 4 years ago

Since this issue has been opened, we have adjusted the SPRT bounds, and I assume they have made this less likely to occur. I'll close the issue.

official-stockfish / Stockfish

Simplifications and reintroductions #3073