official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.3k stars 2.25k forks source link

adjusting throughput on number of tests #3234

Closed mstembera closed 3 years ago

mstembera commented 3 years ago

Since you locked the conversation here https://github.com/Vizvezdenec/Stockfish/commit/b5fc79f8adf0aaf2b3172c4a7221b1697d52eb32 I simply reply here: Unfortunately you confirm you think everything is about you.

Vizvezdenec commented 3 years ago

I simply will take an extended break and will see how you will substitute. Feel free to make PRs if anything passes and submit as much tests as you want (1 per day is what you usually do on average). I will be doing smth more interesting than arguing with almost inactive developers that don't even participate in any meaningful discussions like SPRT bounds and stuff despite being tagged there a few times. Feel free to do whatever you want w/o me "abusing system" and other stuff.

mstembera commented 3 years ago

Contributing to a project doesn't give you the right to put yourself ahead of others. Rest assured I won't do this

image

MaximMolchanov commented 3 years ago

So who will do the work @Vizvezdenec did then? And what are the advantages if this is not done?

Vizvezdenec commented 3 years ago

don't worry. According to fisherman it's better to not test anything than "abusing" system submitting 8 tests at idle time when no one will submit any test in next 5 hours. Cause you know idling fishtest is how stockfish improves. More idle time = more elo. Seems logical and I'm accepting this.

ianfab commented 3 years ago

Can't we just have a very pragmatic solution and significantly reduce throughput when multiple very similar variations are tested at the same time, e.g., something like throughput of variants of the same idea should add up to max 100% (or whatever number). When fishtest is idle it does not matter how low throughput is, the tests will anyway be picked up, and otherwise nobody else is blocked by the parallel attemps. It might not be a perfect solution, but it should be something that addresses the valid concerns of both of you (overuse of resources vs. idle workers).

gekkehenker commented 3 years ago

I don't see the problem with Viz applying many patch variations.

If you think Viz is "stealing" resources you can set the priority of his patches lower. Now Viz's patches will just be replaced by more net tunes. Unless you think so lowly of Viz, I don't see how that's a net gain.

noobpwnftw commented 3 years ago

Shouldn't worker suppliers have the ability to refuse running tests of which they considered crap?

I've seen worse, but there is a priority system and you never actually managed to overload the test queue except when the server is dead in recent years?

xoto10 commented 3 years ago

Maybe a limit on concurrent tests from one user would help? Extra tests automatically put to prio -1 until one of previous tests finishes? Or just automatically downgrade the throughput if more than 4 (*) tests from one user? e.g. 8 jobs at 50% throughput.

On Thu, Nov 26, 2020 at 3:28 PM noobpwnftw notifications@github.com wrote:

Shouldn't worker suppliers have the ability to refuse running tests of which they considered crap?

I've seen worse, but there is a priority system and you never actually managed to overload the test queue except when the server is dead in recent years?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/official-stockfish/Stockfish/issues/3234#issuecomment-734360402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTEM7DFXT3VUCWB2AFJMR3SRZX2XANCNFSM4UCK7ROQ .

mstembera commented 3 years ago

@ianfab @gekkehenker You both suggest lower throughput/priority just as I have suggested here https://github.com/Vizvezdenec/Stockfish/commit/b5fc79f8adf0aaf2b3172c4a7221b1697d52eb32#commitcomment-44524165 but you can read the response below.

@noobpwnftw We are all extremely grateful for the valuable resources you provide and hope we don't take them for granted or waste them. As such I automatically support any suggestions you have. The queue doesn't get overloaded thanks to you. The issues isn't that there are too few resources but that one user is selfishly grabbing more than 50% for himself. As a result all other users have to wait twice as long for their results hindering their ability to contribute.

An easy automatic solution is to assign fishtest resources per user instead of per test.

adentong commented 3 years ago

This is honestly such a pointless argument... STCs usually fail quickly anyways and it's better to keep fishtest running at full throttle than leaving it half empty. I'm sorry but I really fail to see your point of view @mstembera What you're saying would make sense if fishtest is super overloaded with a crapton of tests already, and then Viz submits a dozen tests on top of that, then maybe I can kinda see where you're coming from, but tbh fishtest almost never gets overloaded to such an extent...

adentong commented 3 years ago

But instead of directing the argument at Viz, maybe it's more productive to have an actual discussion on prio and throughput. @vondele What do you think?

MaximMolchanov commented 3 years ago

I agree with @adentong . Any test (even failed one) is a contribution to Stockfish. Someone has to do such job as launching a lot of tests. I don't think it's constructive to focus on who exactly run the test.

mstembera commented 3 years ago

@adentong I think you fail to understand that fishtest is always running at full throttle. The issue is that a majority of the resources get diverted to one selfish user who schedules 6 versions of the same patch. This is hurting the productivity of everyone else overall.

MaximMolchanov commented 3 years ago

@mstembera Weren't these tests launched at night when fishtest was free enough?

Vizvezdenec commented 3 years ago

All you need to know that this were not 6 versions of the same patch.
It were 4 versions + 2 versions of completely different patch which I was too lazy to make separate name for so made 5 6 instead of #patchname 1 2 - it was almost in the same place of search.cpp and I'm usually reluctant to make new patch names if I'm not making full reset of code. But because some approvers tend to approve a lot of stuff not even making a slight glance on what is written (and then I need to stop this and inform author that either this logic is corrupted / it was recently tried / that it's not good to try 22nd patch on the same idea that never produced a single green stc, etc) they think that it's 6 versions of the same patch. And then they try to teach everyone how to operate fishtest telling about some mythical "full throttle" while every person who actually speaks with noob reasonably frequently knows he has like 10-20k cores he can launch on fishtest at least for like 8 hours/day effectively increasing it calculating power x5 - but there are not enough tasks to load them so he does it only by personal demand (for example, in discord). And he did this multiple times when we actually had flurry of patches loading fishtest up to 50 tasks - which this cores managed to completely clean in 3 hours (and people were submitting new and new tests!). If anything we can write 3x patches, but I understand why noob wants to not give his cores on some tests that can't be committed or useless tuning sessions that produce nothing (especially for classical eval which is kinda obviously a dead end nowadays). Before making some claims at least get some knowledge or you will look stupid.

mstembera commented 3 years ago

@MaximMolchanov What is at night? We are all in different time zones but you can see the status from the screen shot above. Anyway he does this regularly.

@Vizvezdenec "2 versions of completely different patch"? Please stop lying. Anyone can see for themselves these were all versions of the same patch. All comparing lmrDepth, (contHist[0])[movedPiece][to_sq(move)], and (contHist[1])[movedPiece][to_sq(move)] just to different constants. https://tests.stockfishchess.org/tests/view/5fbdca4267cbf42301d6b268 https://tests.stockfishchess.org/tests/view/5fbdca7767cbf42301d6b26a https://tests.stockfishchess.org/tests/view/5fbdcab967cbf42301d6b26c https://tests.stockfishchess.org/tests/view/5fbdcaf667cbf42301d6b26e https://tests.stockfishchess.org/tests/view/5fbdcb5367cbf42301d6b270 https://tests.stockfishchess.org/tests/view/5fbdcb8967cbf42301d6b272

adentong commented 3 years ago

@mstembera I'm sorry if what I'm about to say may offend you, but at this point I think you either have a personal vendetta against Viz, despite your claim otherwise, or you're too stubborn and too hung up on technicalities. Six variations of the same idea that either passes or fails in 20k or 30k games is seriously not even going to put a dent in fishtest's efficiency and capability, and that's not even considering the fact that noob is willing to temporarily inject his fleet if ever fishtest doesn't have enough cores to resolve the patches quickly enough. I still think that this whole thread is completely pointless, and as such I will not be making any further comments because I've said all I wanted to say.

gvreuls commented 3 years ago

I think this is a bit fishy, I have nothing against @Vizvezdenec and do appreciate his extensive contributions to the project, but I do agree with @mstembera that running 12 tests simultaneously is a bit much. My point isn't that the framework can't handle one dev doing this but if one dev is allowed to do this you can't stop other devs from doing the same thing and also run a dozen tests simultaneously, and if more devs start doing this it would affect the framework. This very thread proves that this can lead to conflict among the devs, so there probably should be some rule/maximum.

vondele commented 3 years ago

I was hoping that this kind of discussions would be unnecessary, and that we can be self-organized based on being reasonable and flexible, for example by self-regulating and setting the priority on our own tests manually. I usually don't like strict rules, as they get in the way of making progress. Additionally, in the past years, in large part thanks to @noobpwnftw generous support of the project, waiting/running times have always been relatively short.

Yet, I also observe that that some of our most productive devs have many more tests in the queue. On the one hand, this is a good thing, it is how we make progress. On the other hand, we should make sure that those that have a different development style, submitting only a few patches per week, have also a satisfying testing experience. Ultimately, fishtest crunches through all patches submitted, so each dev essentially gets whatever 'share' he wants. I definitely don't think fishtest should be 'developer' centric, right now it is patch centric, each patch gets its fair share.

So, I don't think we should adjust priorities out of 'fairness' but rather as a tool on our side to make testing attractive for new/irregular devs, and provide an incentive for those to be active, i.e. grow our pool of developers. That is an important long term strategy.

As such I could consider adjust priority based on the number of patch in the queue by a given user. Some options could be

SFisGOD commented 3 years ago
  • a boost for a test coming from a user with a single active test

@vondele There was a proposal back then by @MJZ1977 here https://github.com/official-stockfish/Stockfish/issues/3066 but you rejected that idea. Anyway, many devs would be happy with that proposal since I think devs who prefer to submit only 1 or 2 outnumber those who submit greater than 4. They just need to come back from their long vacation. @31m059 I hope you are doing well :)

Vizvezdenec commented 3 years ago

Sorry but "pruning moves with clean countermove history" (idea taken from laser chess code) and "pruning earlier moves with really bad countermove history" (idea created by me) are different ideas, at least in my opinion. At least if you add THEM BOTH they wouldn't intersect. So I see it as 2 different ideas on refinement of countermove based pruning.
Also what I can say - you can approve or not approve as much as you want. Or approve 5 tests and then approve 5 more test within 4 more hours. I will keep submitting tests as I do because all your logic is based on claim that "fishtest lacks resources" - which is, as I said, just false - and I still see you silently ignoring this part of my speech twice (nice technique in arguing, but you are not going to fool anyone with it). If anything it lacks active developers. Needless to say that when I hit 10 elo gainers in span of 40 days for some reason no one told me that I'm doing smth wrong (and I submitted like 3x of tests on average compared to current days) and now when I'm on a cold streak you happily appear to say smth. Anyway, I don't want and wouldn't continue discussion on this topic and will remain with my opinion. So I will mute this thread and mute it author personally to not receive any meaningless input. Stay tuned to my next attempts to improve stockfish play after I finish my break. PS currently we have 14 active tests, 5 of which are about classical eval (dead end which should not really be 100% priority ever because it's not really improving stockfish play in main mode), 2 tunings (net that was tuned and retuned 55 times and another classical eval term) and 2 simplifications - so basically 9 / 14 tests are not really going to bring us any elo, thus 70% of fishtest is loaded with stuff... Not really useless, but smth that wouldn't really improve stockfish. At least it's not filled with my useless tests I guess.

MJZ1977 commented 3 years ago

I agree all what @vondele said. It is better to make some new rules, flexible, and without necessary changing fishtest source code. Another possibility to those listed by vondele is to limit each user to 3 or 4 tests, all his other tests are set to priority = -1. By the way, I took a long vacation from SF programming because of new baby growing and overcharge at work. But I should be back soon.

SFisGOD commented 3 years ago

By the way, I took a long vacation from SF programming because of new baby growing

Congratulations on your new baby!!

SFisGOD commented 3 years ago

net that was tuned and retuned 55 times

I will try to explore other softwares for tuning. The 16k parameters of the 512 x 32 layer are not tuned yet so I'm targeting that.

Vizvezdenec commented 3 years ago

Also if anything problem with classical eval tests is that they are THE most heavy tests in terms of games needed to converge. Because they have much lower drawrate but the same SPRT bounds - thus needing 3x times to converge at LTC (!!!). So we have absolutely paradox of a situation - tests that tend to improve not the mode in which stockfish plays the strongest consume x3 resources at LTC (and x1.8 at STC) - and are running with the same prio. If anything this should be the thing to resolve first.

Vizvezdenec commented 3 years ago

I think this is a bit fishy, I have nothing against @Vizvezdenec and do appreciate his extensive contributions to the project, but I do agree with @mstembera that running 12 tests simultaneously is a bit much. My point isn't that the framework can't handle one dev doing this but if one dev is allowed to do this you can't stop other devs from doing the same thing and also run a dozen tests simultaneously, and if more devs start doing this it would affect the framework. This very thread proves that this can lead to conflict among the devs, so there probably should be some rule/maximum.

if every dev would submit 12 patches then we probably will be finally convince noob to land fleet on fishtest on a regular basis. You are thinking by categories of 2018 where our progress was hindered by lack of hardware - nowadays our progress is mostly hindered by lack of ideas. Trust me I could've submitted 3x more patches and fishtest with noob targeted help would crunch this easily, just that submitting 700 reasonably meaningful tests / month is gigantic load even for my passion. If anything I'm pretty exhausted even now, I needed this weekly break. I repeat - we are not lacking hardware. We are lacking ideas. No matter how you try to twist it this is reality.

mstembera commented 3 years ago

@Vizvezdenec I see you continue with your strategy of lying. I never once claimed "fishtest lacks resources". I claim you use them selfishly at the expense of other devs. I find it fascinating that you care about fishtest resources when it comes to everything else but your own actions.

@vondele I would have also preferred people being reasonable but even now the person in question continues to say he won't change his behavior. Your priority proposals are the best solution IMO. The second scheme seems the better of the two as it takes N into account. Do you have any thoughts on people submitting many versions of the same patch simultaneously?

Sopel97 commented 3 years ago

Is this really an issue with testers, or maybe an issue with approvers? Why can't there be multiple tests on the same idea in the waiting queue? It does make sense to schedule multiple tests and let approvers decide what should run in parallel and what should run in sequence. I definitely agree that running multiple tests in parallel that touch the same code is bad, even if only because it's hard to decide what to do if both pass, but testers have no say in whether the tests are run in parallel or not, so let's not blame them.

mstembera commented 3 years ago

@Sopel97 Historically the only purpose for approvers is to prevent someone from submitting malicious code. If it wasn't for that we wouldn't even have approvers. Approvers have done other things like help people catch bugs etc. but that's beyond the job mandate. I think users can do a better job deciding in which order they want their patches to run. Often people schedule with priority -1 (for example LTC when STC looks like it will pass) so when other related things finish they are already preapproved and can either set to priority 0 or cancel the test.

Vizvezdenec commented 3 years ago

@Vizvezdenec I see you continue with your strategy of lying. I never once claimed "fishtest lacks resources". I claim you use them selfishly at the expense of other devs. I find it fascinating that you care about fishtest resources when it comes to everything else but your own actions.

@vondele I would have also preferred people being reasonable but even now the person in question continues to say he won't change his behavior. Your priority proposals are the best solution IMO. The second scheme seems the better of the two as it takes N into account. Do you have any thoughts on people submitting many versions of the same patch simultaneously?

Funny joke, but the only reason to not use resources selfishly at the expense of other devs is if we actually lack resources. Otherwise I'm not using them at the expense of other devs by definition, because if we have surplus of resources (which I point you for the 3rd time) you can't use them "at the expense". Also "fishtest is always running at full throttle" is not my wording. I see you continue your strategy of trying to twist meaning of what you say so you can sound less of a [censored word]. Meanwhile we against have 70-80% of fishtest being classical eval patches and someone even submitted speculative LTC for failed {-0,25, 1,25} test. Interesting that this person then calls me selfish.

SFisGOD commented 3 years ago

I think we should change SPRT bounds for classical tests. These tests take too many games to resolve. Submitted a fishtest issue here https://github.com/glinscott/fishtest/issues/870

MJZ1977 commented 3 years ago

@mstembera @Vizvezdenec : I think we need to stay positive and look forward. If the rules are modified, the problem shoud be solved.

mstembera commented 3 years ago

@Vizvezdenec Not true again. No matter how many resources fishtest has if you schedule over 50% to yourself everybody else's patches will take twice as long to finish.

@MJZ1977 Thanks.