mbuhot / ecto_job

Transactional job queue with Ecto, PostgreSQL and GenStage
https://hexdocs.pm/ecto_job/
MIT License
277 stars 36 forks source link

Base expiry constant is not clear #16

Closed mmartinson closed 6 years ago

mmartinson commented 6 years ago

Hey. So I've been reading through your library to determine if its appropriate to use in a production codebase that need to handle fairly modest load but with strong safety guarantees. There's one piece of behaviour that I didn't understand exactly, which is the the 300s base expiry value, where it comes from, whether it makes sense to have it configurable, and how it interacts with the rest of the processing lifecycle.

I understand that this 300s for the RESERVED expiry is the maximum time between when the job is dispatched from the Producer while the Worker is permitted to begin processing. For this use case it seems like an arbitrary but reasonable value, if not a bit high, though it's not obvious what the failure conditions would be in GenStage and when it would make sense to retry earlier

For the IN_PROGRESS expiry during attempt 1, which would be 300s, is it correct that only on the next poll, if the expiry had passed then the job would be retried, assuming it had failed?

The reason I ask, is because 300s seems too long for a first retry attempt. As far as I can understand the real retry timeout would be whatever is smaller between the expiry and the poll interval. Does it makes sense to you @mbuhot to make this configurable as well? Can you think of any unintended consequences of setting it to, say, 10s, other than burning through the retry attempts more quickly?

mmartinson commented 6 years ago

17

mbuhot commented 6 years ago

Hi @mmartinson thanks for taking the time to look over the code.

The RESERVED expiry can certainly be shorter. It appears to be safe to let a reservation expire. It will only be moved to the IN_PROGRESS state by a single worker.

For the IN_PROGRESS expiry during attempt 1, which would be 300s, is it correct that only on the next poll, if the expiry had passed then the job would be retried, assuming it had failed?

Yes this is correct.

The reason I ask, is because 300s seems too long for a first retry attempt. Can you think of any unintended consequences of setting it to, say, 10s, other than burning through the retry attempts more quickly?

The 300s IN_PROGRESS expiry is a conservative value because it starts as soon as the worker moves the job to IN_PROGRESS. If it were as small as 10s, then it makes it more likely that the job is retried on another node while a worker is still executing it.

It might be useful to add a separate concept of retry_delay that applies when we know that a job has failed, eg because an exception is thrown from perform. Leaving the IN_PROGRESS expiry to apply only if the process/beam crashes without a chance to update the job record in the database.

Either way the hardcoded 300s should definitely be configurable. 👍

mbuhot commented 6 years ago

Closing this one as it's been inactive for a while.

mmartinson commented 6 years ago

oops forgot to close! thanks