Open martinos opened 4 years ago
Hey Martin,
Can I see your Honeydew.start_queue/2
?
Here it is:
Honeydew.start_queue(login,
queue: {Honeydew.Queue.Mnesia, [disc_copies: nodes]},
failure_mode: {Honeydew.FailureMode.ExponentialRetry, times: 15, base: 2}
)
I've just changed the times
param to 15 and restarted the server. It was 10 prior to that.
Huh, that's quite strange. It's either that the job is delayed by a large amount of time for some reason, or the queue's :finally
option didn't abandon the job correctly.
Can you show me the result of:
:mnesia.activity(:async_dirty, fn ->
:mnesia.foldl(fn wrapped_job_record, list ->
[wrapped_job_record | list]
end, [], :"honeydew_\"my_queue\"")
end)
That should show us when the job is meant to run next, it includes the delay information.
Oh, and also, System.system_time(:millisecond)
when you run the above, please.
@koudelka, The job got processed at 17h47. About 3hrs after the last retry. So I can't give you more information since the command that you ask me to run was empty. I am planning to test my app on production Monday, I will probably get the issue. Thus I will be able to give you more information.
I am running the app on my dev computer and I am having the issue. When I run the mnesia.foldl function I get the following
[
{:wrapped_job, {-576413803, -576460752303419643},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 8191,
enqueued_at: 1581829711489,
failure_private: 14,
from: nil,
job_monitor: nil,
private: -576460752303419643,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}},
{:wrapped_job, {-576415451, -576460752303419515},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 4095,
enqueued_at: 1581829707732,
failure_private: 13,
from: nil,
job_monitor: nil,
private: -576460752303419515,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}},
{:wrapped_job, {-576418152, -576460752303420214},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 4095,
enqueued_at: 1581829872436,
failure_private: 13,
from: nil,
job_monitor: nil,
private: -576460752303420214,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}},
{:wrapped_job, {-576418152, -576460752303421375},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 4095,
enqueued_at: 1581829868556,
failure_private: 13,
from: nil,
job_monitor: nil,
private: -576460752303421375,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}},
{:wrapped_job, {-576419177, -576460752303419579},
%Honeydew.Job{
by: nil,
completed_at: nil,
delay_secs: 2047,
enqueued_at: 1581829714651,
failure_private: 12,
from: nil,
job_monitor: nil,
private: -576460752303419579,
queue: "my_queue",
result: nil,
started_at: nil,
task: {"send_msg",
[
%Smsc.Mt.Payload{
delivery_method: nil,
delivery_url: nil,
dest: nil,
login: nil,
msg: nil,
msg_id: nil,
orig: nil,
smsc_id: nil
}
]}
}}
]
I think that if the Job cannot be processed at the time specified it will never be processed.
I have configured a delayed job that would occur in 1 min. Right after I've shutdown the application for 2 mins. the job never got picked up, so it stayed in the queue.
Maybe my problem is related to this. My theory is that a job that failed was postponed to a later time. At that later time, the workers were busy thus the failed job skipped the time it was supposed to be processed, thus it never got processed.
@koudelka, I still have few kinds of stuck job issues. I am just wondering, in what order jobs are processed. Does the system processes them in the enqueued_at + delay_in_secs
order ?
How can I list the retried jobs ?
I have 18 jobs that are stuck in the queue. I don't understand why because the queue is still processing jobs even if I have only one worker configured. So it seems that it's not related to the worker.
I am pretty sure that the issue was caused by a shutdown.
Are you able to create a stand-alone reproduction of the issue?
I can't understand how the delay works. If I run the following code in the iex console:
defmodule Worker do
@behaviour Honeydew.Worker
def hello(thing) do
IO.puts("Processing #{thing}")
:timer.sleep(5000)
IO.puts("Completed #{thing |> inspect}!")
end
end
defmodule App do
def start do
nodes = [node()]
:ok =
Honeydew.start_queue(:my_queue,
queue: {Honeydew.Queue.Mnesia, [disc_copies: nodes]},
failure_mode: {Honeydew.FailureMode.ExponentialRetry, times: 15, base: 2}
)
:ok = Honeydew.start_workers(:my_queue, Worker, num: 1)
end
end
App.start()
{:hello, [1]} |> Honeydew.async(:my_queue, delay_secs: 30)
{:hello, [2]} |> Honeydew.async(:my_queue, delay_secs: 40)
{:hello, [3]} |> Honeydew.async(:my_queue, delay_secs: 50)
And I shut down the the app right after the last command then I restart it 2 mins after, it takes a while before the jobs gets processed.
When you say that you "shut down the app", do you mean that you terminated the entire BEAM, or you shut down the honeydew application?
The entire beam, ctrl-c twice in the iex console.
Honeydew uses the BEAM's monotonic clock (an arbitrary time since the OS process was started), it doesn't use the wall clock time, since that's prone to skews. The monotonic clock keeps ticking when the BEAM timewarps, and hence should keeps delays working properly while the system is running, rather than optimizing for the case that the BEAM is restarting often.
I believe you're seeing this behaviour because the monotonic clock value + the delay value is persisted in mnesia and then read again when the system restarts (even though that clock value is meaningless).
I'm going to dig in a bit more and get back you.
Just for reference, since you mentioned in a previous issue that you were new to elixir: http://erlang.org/doc/apps/erts/time_correction.html
:)
I've just pushed a fix to the reset-mnesia-run-at
branch, want to give that a spin and let me know how it goes?
Thanks @koudelka, now the job execution are repeatable however it seems that the jobs delay are now counted from the time the application starts.
I am not sure that it's the best solution since in my case I can have some retry that have 32738 secs delay. In those cases, if I shutdown the application 1 second before the end of that delay and I restart the app, the job will only gets executed only after another 32738 seconds.
Does this makes sense to you ?
Yep, that makes sense, thanks for pointing that out.
I've just pushed a commit to reset their delay times. Give 'er a spin. :)
If I run the following in the console:
enqueue = fn data, delay ->
opts =
case delay do
0 -> []
a -> [delay_secs: a]
end
{:hello, ["#{data |> inspect} enqueued at #{NaiveDateTime.local_now()}, delay: #{delay}"]}
|> Honeydew.async(:my_queue, opts)
end
1 |> enqueue.(30)
2 |> enqueue.(40)
3 |> enqueue.(50)
And I restart iex. I get the following
Application started at ~N[2020-02-20 16:21:12], queue count 3 before starting the workers
Worker started
Processing "3 enqueued at 2020-02-20 16:20:57, delay: 50", processing at: ~N[2020-02-20 16:21:12]
Completed "3 enqueued at 2020-02-20 16:20:57, delay: 50"!
Processing "1 enqueued at 2020-02-20 16:20:57, delay: 30", processing at: ~N[2020-02-20 16:21:17]
Completed "1 enqueued at 2020-02-20 16:20:57, delay: 30"!
Processing "2 enqueued at 2020-02-20 16:20:57, delay: 40", processing at: ~N[2020-02-20 16:21:22]
Completed "2 enqueued at 2020-02-20 16:20:57, delay: 40"!
If you look at Job 3
, it was started at 16:21:17
. I think it should to be run at 16:21:47
.
Note that have only one worker.
That looks about right to me. Since Honeydew isn't using wall clock time, and only the monotonic clock, it has no concept of time between BEAM instances.
That means that Honeydew doesn't know how much time has passed between when the BEAM was shut down, and when it was started again, so it can't know if the delay interval has already lapsed, so I just assumes that all jobs need to be run.
Does that make sense?
I wondering if can we use the enqueued_at
and delay_secs
field to recalculate the remaining time. If the time has lapsed, we can put 0 if not we can recalculate the delta. There's probably something that I don't understand. I need to read more about the monotonic clock.
Yeah check out the docs I linked above, especially the bit about monotonic clock freezes and absolute time warps.
The choice here is relative time (since the BEAM started) vs absolute wall clock time.
My thinking is that the queue is spending the majority of its time operational, rather than the majority of the time switched off, so that's the case I'd prefer to optimize for.
With absolute wall clock time the worst case scenario is that a machine's clock leaps far into the future, the user enqueues a job, and then the clock corrects back. The job wouldn't run until it reaches that point in time, which would lead to a large service degradation.
With the relative monotonic clock, the worst case scenario is that a delayed job runs too soon after a queue restart.
From my perspective, in the pathological worst case, I'd much rather run a job too soon than not at all. The former is sub-optimal (job ran too early), the latter is a constraint violation (job will never run).
If you do something like:
def reset_run_at(
%__MODULE__{job: %Job{delay_secs: delay_secs, enqueued_at: enqueued_at}} = wrapped_job
) do
elapsed = System.system_time(:millisecond) - (enqueued_at + delay_sec * 1_000)
new_delay = if elapsed >= 0, do: elapsed, else: 0
run_at = now() + delay_secs
%__MODULE__{wrapped_job | run_at: run_at}
end
It would still use the relative monotonic clock for operational purpose but use the OS clock just for boot time when recalculating the run_at
field. I think that this is reasonable since this calculation occurs only at boot time. I understand that the OS clock can be adjusted by NTP or by other time drift when the application is down, but not very likely to cause an issue (I think). It will only impact delayed jobs that survives a reboot.
I'd much prefer to just run delayed jobs immediately when the application starts, so honeydew can guarantee that a job will run regardless of bad behaviour by the system clock. If the user would prefer to rely on the system clock, they can easily write that into their jobs.
It's not exactly what I wished because sometimes I have hundreds of retries. When the app boots , the jobs will all be retried at the same time. If they fail a second time (this happens in my case), some of them might still run at the same time thus interrupting the normal flow.
Personally I think that clock don't get that much out of sync and issues with the clock are due to time adjust with NTP, which if it occurs during a shutdown won't affect much the delivery of jobs after a boot. The big problem with the wall clock is mostly when the app is running.
Also, It's confusing for newcomers to see that jobs are not executed at enqueued_at + delay_secs
.
I definitely hear you, I don't think that either end of the spectrum is ideal for all use cases. I do suspect that massively damaging NTP skews should be rare, but I would also like to provide a guarantee rather than a probability.
That being said, what do you think about allowing the user to choose if they want to use the monotonic clock or the wall clock for their queue? That way we can leave it up to the user to decide if the risk tradeoff makes sense for their use case.
I do like your hybrid approach, too. Using the monotonic clock during normal operations, but trusting the system clock between beam restarts. That'd keep the complexity for the user to a minimum too.
I started using Honeydew because it was a queue that had persistency. I wanted an easy way to restart the application at any time for continuous deployment. But, every time that I deployed, all failed jobs were retried at the same time on boot. This was making the queue unusable in my case since hundreds of jobs were restarted at the same time this causing issues with my clients.
It was also surprising to me since I was not expecting that behavior at all.
The other point that I would like to make is that most Elixir systems are running 24/7, which is a good point for using a monotonic clock. If we get a time change (from standard time to daylight time) the scheduled job will be run at the proper time.
The time between restart is mostly short thus making the likelihood of time shift of wall clock less likely to occur. Also if this occurs during time change (from daylight time to normal time) the person will probably make the link between the events. Even then, the scheduled event will be executed anyways maybe an hour too soon or too late but it will be run.
I have made a fork of the project using the wall clock for recalculating the run_at field on restart and everything plays well for now and I have redeployed at least 20 times since then without any issue. This solution works well for me and I think it would be a good solution because people won't need to know about monotonic and wallclock (which was my case before using this project).
Thus I think it would be better to use the monotonic clock for job scheduling but using the wall clock on restart.
Hi,
What is the status of this issue? I run version 1.5.0 and I still have this problem. I have a Mnesia Queue and when I shutdown the application (^c^c in iex) with entries in the queue, they remain in the queue when I restart the application. New entries are processed immediately, but the ones that came from disk are not until after some random time. At a sudden moment (after more than 30 minutes) they get processed, but I have no clue what triggers them to get processed finally.
Is there perhaps a way to instruct the queue to force the enqueued entries?
UPDATE: after applying the changes from reset-mnesia-run-at the jobs from disk get processed after the queue is resumed. Maybe someone can make a new release so this can be used by everyone?
@heinsaris. I have created a pull request that fixes the issues that I was having. Note that if you want to use this branch git@github.com:ISPT/honeydew.git, you'll need to drain your queues before shutdown since I have modified the mnesia job schema. The approach of branch (reset-mnesia-run-at) does not estimate correctly the time at which the original Job was scheduled at.
@martinos Thanks for your reply. I have forked the master branch and applied the changes on the reset-amnesia-run-at branch, which seems to work for my case. I don’t need delays or scheduled jobs. I use the queue to send packets to a mqtt broker. As soon as the connection to the broker disappears, I suspend the queue, and I “unsuspend" it when the connection is back. I need all packets to arrive on the broker in order, so I only use one worker.
You can find my fork here: github.com/heinsaris/honeydew
@heinsaris, I think my patch is irrelevant to your case, but in the case that we have exponential retries, the reset-mnesia-run-at was not working as expected.
When a retry is made, a delay is added to the job. When the application restarts (with that branch), that delay is added to the current time, thus moving the execution to a later time. If you have multiple jobs with the same delay (that failed the same number of times), they will be run simultaneously, which in my case, made the workers run tons of failed jobs simultaneously. Which, in return, were basting my customer's APIs.
My branch is recalculating the original retry time on restart. That change spreads the execution of the job in a more natural way.
@martinos Thanks for the clarification. I will have a look at your solution but as you said it is not relevant in my case. I suspend the queue as soon as the connection to the mqtt broker drops, and normally my app never shuts down. It is only in the rare case that something goes wrong and the app stops while the mqtt connection is down that I need the packets saved on disk. But in normal circumstances the queue is always empty.
I have an issue with a Flow/stream and a hanging job. Like the symptoms above... I have no returns with :mnesia.activity/2 calls. I can see the worker with the job in a pool of 16 workers but cannot actually even use filter to surface any job like @martinos had demonstrated in the first place. It's simply showing up in the status output as the only job :in_progress. I even ran the job isolated and still has the issue. My hope is that setting the +C multi_time_warp
in our vm.args file for Distillery will solve the issue. I will report my findings today.
UPDATE: What I have found is that during a refactor I missed moving a File.open/2
call to the Flow.reduce/3
start_fun call. So in my case this just means I didn't realize how many "empty file events" were occurring with this new "reporting date cutoff". So, post-refactor I can see that inotifywait
(now that I open/close file on each chunk I stream into it after it gets piped to our file reduction Flow) is still processing these events which is seemingly creating an I/O block somewhere in our file watch... I think it's just a bi-product of using Honeydew.yield/2
in our file watch GenServer implementation. Hmmmm.... I really appreciate you @koudelka for spending time on this issue as it has created transparency to my own.
I have a job that was retried multiple times, but it got stuck in the queue.
Here is the job:
My workers seem to continue processing other jobs on that queue.
Here is my queue status: