graphile / worker

High performance Node.js/PostgreSQL job queue (also suitable for getting jobs generated by PostgreSQL triggers/functions out into a different work queue)
http://worker.graphile.org/
MIT License
1.82k stars 103 forks source link

Option to keep completed jobs for awhile #228

Closed stephenh closed 3 years ago

stephenh commented 3 years ago

Feature description

Add an option to keep a completed job for a certain time period, i.e. ~24 hours.

Motivating example

We'd like to use job keys to de-dup incoming web hooks from external vendors, i.e.:

Breaking changes

Should be none b/c it can opt-in.

Supporting development

I:

benjie commented 3 years ago

Please use a separate table to track this; keeping jobs around unnecessarily compromises worker performance.

stephenh commented 3 years ago

Ah yeah, that's fair. I think pedantically, from the overall system perspective, it's just moving where the performance compromise would be, i.e. instead of contention on the jobs table itself, I'm going to move the same de-duping+N-hours-of-history logic to some other table, that exists solely to be a mini-re-implementation of graphile-worker's job-key support.

But I will grant that, for use cases that allow/facilitate it, ideally tracking idempotency is done on the existing business/domain tables directly. It just seems like that is not always the case, and I've seen other job queues support this, i.e. pg-boss iirc does maybe as an optional feature or maybe even by default? Not sure.

But yeah, that's fine; we don't need this immediately; and if we ever do, maybe we'll try doing it as a small mini-fork b/c I don't think the performance concerns would be a problem for our volume/use cases.

Thanks!

benjie commented 3 years ago

Yeah; sorry I didn't mean it'd compromise performance for you (though I would guess that you'd probably have better performance with this split solution anyway due to your deduping table being much smaller and much less in demand than the worker table), more that I don't want to do anything in worker that encourages the jobs table to not be empty because a lot of people think they need this but it's normally better solved a different way. I also don't want to add complexity to the worker where it has to check to see whether it should keep jobs around or not, and I don't want to add a system that checks how long the completed jobs have been around for and then cleans them up. Postgres is not the ideal location for a job queue, so we have to be pretty strict around what we do/don't add to make sure we're squeezing as much performance out of it as we can - every branch matters!

You are of course welcome to fork worker to achieve your use case, but I'd personally just make a separate table and an alternative function (create_job or similar) that calls add_job under the hood but with your additional requirements - this'll reduce your maintenance burden over time.

stephenh commented 3 years ago

make a separate table and an alternative function (create_job or similar) that calls add_job under the hood but with your additional requirements

Ah yeah, that makes sense. Thanks for the suggestion!