regro / cf-scripts

Flagship repo for cf-regro-autotick-bot
Other
46 stars 70 forks source link

make bot-share proportional to "Awaiting PR" #2612

Open h-vetinari opened 1 month ago

h-vetinari commented 1 month ago

Just noticed the following in the most recent bot-bot run:

      MigrationYaml-snappy120: 1 - gets 141.620771 seconds (1.966955 percent)
      MigrationYaml-numpy2: 20 - gets 226.593234 seconds (3.147128 percent)
      [...]
      MigrationYaml-python312: 120 - gets 1359.559402 seconds (18.882769 percent)

I don't know how the bot shares are calculated exactly, but that to me seems really wasteful, and also not very responsive to new migrations. In particular, snappy120 is an old migration where only one feedstock doesn't have a PR yet ("Not solvable" for tensorflow), and yet it receives about 2/3rds of the resources of the numpy2 migration, where we have 300 PRs awaiting action.

Also, old migrations like python312 constantly hogging 20% of our resources despite very little movement seems like a waste to me as well. I'm not sure how easy it is to square all circles here (e.g. maybe need to include migrator age), but if we were using a score per migration à la[^1]:

   |{Awaiting PR}| + 0.2 * (|{not solvable}| + |{bot error}|)

to determine bot-share, then we'd be IMO:

Within a migration I wouldn't want to deprioritize the error cases though, IMO the choice within a migrator should be driven by the total number of children - what I want to avoid is that an early failure of a big blocker doesn't get retried until all the other small-fry feedstocks get through the queue.

[^1]: |{...}| is the cardinality of the set, i.e. how many feedstocks per status

CC @beckermr

beckermr commented 1 month ago

Yeah I've messed with this a lot over the years. I've definitely made things proportionate to waiting prs and that was a problem for other reasons. (The issue is that new migrations take over and block things if you make the share proportional to the number of waiting prs.)

Also, the bot ramps up migrations automatically so if you've just merged or restarted one, you might wait for about six to ten hours to see what happens.

I'm honestly reluctant to adjust things now but open to ideas. Most of the time nobody notices which is basically the point.

beckermr commented 1 month ago

One more comment is that we used to not retry old migrations very much and then we got complaints that once something was fixed, the bot didn't get back to it fast enough. It's kind of a crap shoot no matter what.

h-vetinari commented 1 month ago

It's kind of a crap shoot no matter what.

Yeah, I get that optimizing for all cases is very tricky. Which is why I'm so adamant to at least not waste the scarce cycles we have on stuff that will 100% fail (#2613).

h-vetinari commented 1 month ago

One thing that would IMO be quite an easy win would be to open PRs for "not solvable" feedstocks after a given time or number of retries. That way, it gets out of the bot retry queue, and also gets it to the attention of the maintainers (rather than having to go dig into the status page entrails). To me that'd be a double-win.

The same reasoning can probably be applied to "bot error" as well.

beckermr commented 1 month ago

Yes this last thing we should do for sure. There are some edge cases around bot automerge but those are easy to handle.

h-vetinari commented 1 month ago

Just realized that if we do that, it automatically also solves #2613 for free 🙃