riverqueue / river

Fast and reliable background jobs in Go
https://riverqueue.com
Mozilla Public License 2.0
3.34k stars 89 forks source link

uniqueness does not work at scale #446

Closed elee1766 closed 1 month ago

elee1766 commented 2 months ago

This issue is a followup to https://github.com/riverqueue/river/discussions/346

@brandur gave me this recommendation

In your case, an alternative: drop the uniqueness checks and then implement your job such that it checks on start up the last time its data was updated. If the update was very recent, it falls through with a no op. So you'd still be inserting lots of jobs, but most of them wouldn't be doing any work, and you wouldn't suffer the unique performance penalty.

however, this solution currently schedules hundreds of rps across our clusters, which is causing a lot of extra load, across all the job logic + the notifier.

more importantly, we have ~200-400k unique units of work every hour or so, but we would really like these things to be done every 15 minutes. without a uniqueness filter, it schedules millions of units of work every hour that while they do end up getting deduplicated at work-time, at the expense of large amounts of db work that ends up slowing down other calculations and other routines, which causes a vicious cycle of more jobs not getting completed, and more jobs piling up.

a side effect is this also causes is that the few places where we do schedule unique to become very slow, and so we basically can't use the unique feature in any jobs without fear those scheduling operations taking multiple seconds because of all the operations currently going on in the jobs table.

we could move river to a separate postgres cluster, but at that point, we would migrate away from river, because the advantage of it running in the same database as our data is gone.

for now we are likely going to implement our own hooks on top of the existing river client using inserttx to not schedule tasks when we dont need to - but it really feels like a weakness of river's unique insert feature. i'm still not really sure who it's for, since it can't scale to any reasonable throughput, and also is missing a good amount of features that come standard in other work queues (the most obvious that comes to mind is being able to do subset of args).

it would be really nice if there was some sort of uniqueness mechanism that didn't use advisory locks, for instance, a nullable unique column with a user-definable id on input in the jobs column immediately comes to mind. this would allow me to de-duplicate tasks by a subset of arguments and time interval/sequence id, which is more than enough for me.

brandur commented 2 months ago

I'm going to look into this, but although we can speed it up, I'm a bit worried that it'll be hard to get to something that works well for you — it sounds like your app is fundamentally churning through so much work that a very busy DB will be somewhat inevitable.

brandur commented 2 months ago

Opened https://github.com/riverqueue/river/pull/451. Should make unique insertions something like 20-45x faster as long as you stay within the default set of unique states.

bgentry commented 1 month ago

I think the changes in #451 (shipped in v0.10.0) are a massive improvement on unique job performance, if you can stay within the bounds of that happy path. Let us know how it goes if you give it a try! :pray:

elee1766 commented 1 month ago

super excited. we are in this happy path, so i expect it to speed up our scheduling by a lot.

@bgentry this is a little off topic maybe, but how do you recommend people do long term job metrics?

do you think should we write something that views the river jobs table and exports prometheus metrics (like river-prometheus-exporter), or instrument our workers similar to how we instrument tracing (wrapping work functions with tracing instrumentation)

im not too sure which was the vision you had for river - so we havent made a move here yet.

bgentry commented 1 month ago

My 100% recommendation is to instrument the workers or use the client subscriptions to do this kind of metrics work. As your job table grows then scanning it in any way other than the exact queries used by River are going to have severe performance impacts.