Non-continuous upkeep of jobs table - Githubissues

timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss

MIT License

2.14k stars 158 forks source link

Non-continuous upkeep of jobs table #106

Closed elmigranto closed 4 years ago

elmigranto commented 5 years ago

I am looking into integrating this awesome lib (thanks, Tim and everyone involved!) into yet another project, but have some concerns on a way I'm planning to setup table monitoring.

The Problem

No good way to select master server which would be running pgboss.start(). All our instances are the same and we would prefer to keep it that way. Obviously, running N-1 "extra" monitoring queries and everything related to that is not ideal, and coming up with "master-selecting" protocol and monitoring that makes me dizzy :)

Solution (?)

Have a periodic cronjob fire via internal HTTP call that's guaranteed to hit just one node, run pgboss.start() on it, wait for the thing to do its job and pgboss.stop() it after.

Concerns

Looking over the codebase and docs, this looks okay, but would be neat to get your opinion?
Is there a better way to make sure old jobs are archived, and expirations are propogated other than setTimeout(someMinutes, guessItIsProbablyDoneByNow)? My understanding is that await pgboss.start() only takes care of schema?
Other than that, my take is pgboss.connect() would take care of all the actual scheduling/processing which is good to go on all the nodes at the same time without any additional coordination?
Some other APIs I am missing, any general tips on the topic?

Once again, Tim, thanks for working on pg-boss and making it available to everyone ♥️ @timgit

elmigranto commented 5 years ago

If I am reading everything correctly (instead of trying to remember things from last time I dug in :), just await pgboss.start(); await pgboss.stop() would do the trick. Looks like supervise (with those upkeep queries) runs right away!

Not sure if it's meant to be public API (not in the usage docs atm), but wondering if that could be useful to expose?..

elmigranto commented 5 years ago

Looks like supervise (with those upkeep queries) runs right away!

Though that promise does not seem to get returned, so in order to be sure it is done, maybe I can provide some kind of wrapper on top of regular executeSql… OTOH, I think I could just run purge / archive / expire directly. What's their status on being "Official Public API"? :)

🤔

timgit commented 5 years ago

I think exporting the monitoring funcs from boss.js directly would be ideal for your use case. You'd still need to be concerned about running them concurrently, so extra caution would be warranted.

elmigranto commented 5 years ago

I think exporting the monitoring funcs from boss.js directly would be ideal for your use case.

I agree, yeah, and decided to call Boss#archive(), Boss#expire() and Boss#purge() manually. I've also found Boss#countStates() useful to call myself instead of on timer. In fact, I ended up not using any of the Boss methods (except explicitly) this time :)

Would a PR exporting them directly on PgBoss instance would be a welcome addition, or accessing PgBoss#boss#<method> is okay? In any case, would you say treating either as Public API is a good idea on my part? (Those are not explicitly documented, but, say, event names are.)

timgit commented 5 years ago

Yes, a PR is welcome. This would likely be just like how the manager api is promoted, right?

elmigranto commented 5 years ago

I would think so, yeah. promoteFunction is already there, so makes sense to use that. (I think it would be helpful to keep a map with things already promoted in there, so we don't accidentally overwrite something and maybe have some kind of check to not export privates like if name starts with underscore if you do that kind of thing, but other than that, yeah, absolutely!)

timgit commented 5 years ago

Having a failing test would be better. ;)

mshwery commented 5 years ago

I came here with the same question as we have some auto-scaling instances and would prefer to avoid having duplicative pgboss.start()s and monitoring queries running.

What would happen if pgboss.connect() was called (and used to create subscriptions and enqueue jobs) before pgboss.start()?

timgit commented 5 years ago

The only potential race condition problem in this setup is when you decide to upgrade pg-boss to a new version which contains an auto-migrated schema change. If you try and connect() before start() has had a chance to migrate the database, connect() will bail out with a version mismatch error.

mshwery commented 5 years ago

If we tried the recommendation for calling await pgboss.start(); await pgboss.stop() first, would that mitigate that race condition?

I'm thinking each instance could call this (to ensure upgrades happen) and then only a single instance would be responsible for calling the "real" await pgboss.start();. I can't guarantee the order these instances spin up though.

What would you recommend for dealing with this?

timgit commented 5 years ago

I would recommend designating a supervisor process responsible for monitoring pg-boss expiration and archiving operations via start() (the "real" one as you mentioned). You should feel free to have any number of instances to use connect() without worrying about if and when start() is called. These are not dependent on each other.

When a new version of pg-boss is released which involves a schema change, you should stop all instances, run the supervisor with start(), wait until it is finished upgrading, then patch and restart all other instances with connect().

timgit commented 4 years ago

3.2.1 has the boss management functions (shown below) exported in the root module now. You would use these with connect(), not start().

expire()
archive()
purge()

mshwery commented 4 years ago

Can you elaborate (or document) how these functions are intended to work? At first glance in the code it seems like they don't provide much benefit for managing them yourself vs letting the lib do supervise on start?

Ultimately, what I'd like to be able to do is define separate archive retention configurations on a queue-by-queue basis. So I have some queues which I don't really need any archive for... but then I have other ones that are more important that I want to keep around for debugging in the archive over a week or so.

Any recommendations on an approach that might work there? I'm also happy to open a PR if you can point me in the right direction to add this kind of support (if it sounds like something you'd like to include).

timgit commented 4 years ago

I've made recent changes to the maintenance operations in 4.0 (currently released in beta) which should resolve what @elmigranto originally requested here, where multiple master nodes are started at different times and you don't want to worry about which instance ends up running monitoring commands. The only remaining race condition is for schema migrations, which I'm in the process of limiting concurrency to 1 instance at a time.

In regards to retention, I was thinking we could add a new config option (ttl, pg internal, date) on publish() which we could use in place of the default timestamp used for the retention policy. This would allow some jobs to survive longer in the archive. Does that sound like it would resolve your archive case?

mshwery commented 4 years ago

You're suggesting adding a new column to the job and archive table that would customize retention on a per-job basis? I think that would solve my use case.

I was thinking in terms of queues/topics, but setting it in the job configuration on publish makes sense too.

A pg interval sounds nice, but would require two columns to interpret (archivedon + retention), unless the internal was used on insert to create a date value?

timgit commented 4 years ago

Yes, I think it would end up being a calculated value that would result in a timestamp column to use instead of archivedOn.

The primary reason I don’t want to make anything queue/topic-based is because they are all virtual and I would have to introduce a new state persistence mechanism to track it, along with its own archive and retention policies.

Also, I published a 4.0.0-beta2 release with multi-master support for schema migrations, which should finally address all the race conditions and complexity involved with running multiple instances simultaneously.