msolli / proletarian

A durable job queuing and worker system for Clojure backed by PostgreSQL.
MIT License
161 stars 7 forks source link

workers in different timezones #19

Open jsn opened 1 year ago

jsn commented 1 year ago

Running workers on machines in different timezones confuses proletarian because it stores process_at in the db as a local timestamp w/o timezone. Perhaps it would be better if the timestamps were all TZ-normalized (stored as UTC datetimes or something).

msolli commented 1 year ago

How so? Jobs are written and queried using Java Instants, which are oblivious of timezones. Have you observed confusing behavior, and in what way?

jsn commented 1 year ago

select * from proletarian.job shows localized timestamps. Wrt "How so?": the problem is, Instant's are converted to java.sql.Timestamp, and java.sql.Timestamp is later stringified using .toString or something (when the query parameters are applied), and that .toString on Timestamp seems to apply local timezone. (-> (java.time.Instant/now) java.sql.Timestamp/from .toString) to reproduce.

and in what way?

Obviously, process_at behavior is all messed up. If at 12:00 a worker in UTC schedules a retry in 2 hours, it is stored in the db as 14:00. At the same time, for a worker in UTC-2 it's already 14:00, so it will run that retry immediately. And vice versa, if a worker in UTC-2 schedules a retry in 5 minutes, it's stored as 14:05. Then, in 5 minutes, only workers in UTC-2 will see that retry as due; workers in UTC will still think it's too early to run this (by 2 hours). And so on, so forth.

msolli commented 1 year ago

Thanks for the detailed analysis! It really is more complicated than I first thought, precisely because of the Timestamp/toString method you mention. Or rather, it's the org.postgresql.jdbc.TimestampUtils/toString method that is used (by PgPreparedStatement/setTimestamp that Proletarian calls).

I think the right way to handle timestamps in applications and supporting infrastructure is to always assume UTC, and handle timezones at the edges. It is this principle I've failed to apply here, not realising that JDBC is one such edge. JDBC applies the JVM's default timezone if none is supplied. That is why we can observe non-UTC timestamps in the job table.

Fortunately, there is an overload for PreparedStatement/setTimestamp where we can provide a timezone (via a Calendar):

    void setTimestamp(int parameterIndex, java.sql.Timestamp x, Calendar cal)
            throws SQLException;

I'll replace calls to .setTimestamp in proletarian.db with this signature, adding a calendar with the UTC timezone. This will ensure all timestamps are written as in UTC, as will the query in get-next-job.

This solves the original problem that you describe, where workers will pick up jobs too early or too late, depending on their local timezone.

However, users might run into this exact problem during the period when they're upgrading Proletarian to a version that has this fix, even when all their machines are running in the same timezone:

Let's say an application server running current Proletarian in the CEST (UTC+2) timezone enqueues a job at 12:00 UTC. The local time is 14:00, and process_at will be 14:00. At the same time, or just before, the workers are upgraded to a new Proletarian with my proposed fix. The workers will query for jobs with a process_at <= 12:00, while jobs just written by old Proletarian will have 14:00 in process_at. Jobs will be picked up with a two hour delay.

The opposite problem would occur when upgrading Proletarian on machines running in a UTC-minus-something timezone: Retries would be run immediately (like you describe above).

This would be an operational problem for deployments outside UTC only. And the problem would only last the amount of hours that the local timezone of the workers are offset from UTC. Still, it is likely to cause significant problems for affected applications. I'll have to come up with mitigations for these problems during an upgrade before releasing a new version. Any ideas are welcome.

jsn commented 1 year ago

I don't have much to add to your analysis, seems like it's gonna be a somewhat bumpy update. Perhaps the best you can do is to add a big warning to the CHANGELOG (and probably to README), explaining the effects and probably recommending an SQL update statement to move all timestamps from whatever their local timezone is to UTC.