Open matiasgarciaisaia opened 2 years ago
It looks like a communication issue between verboice and surveda: the former failed to capture the MP3 file with a timeout. If I understand the code correctly, the Twilio PBX crashed, the session handler noticed it and logged an internal error, but the close state got never changed?
I forgot to say that I reviewed issues logged in Sentry and found nothing remotely related, hence the network issue.
Errata: the capture
function has nothing to do with downloading the MP3 file from Surveda! It's instead trying to capture the respondent's reply through Twilio (for a given question) but it times out.
I believe this is this specific call (which in turn calls this in the actor):
Either Twilio never responds, or there was a network error, and/or the twilio_pbx
actor crashed. In the end, the Broker eventually reaches the 5 minutes timeout while waiting a reply or hangup. It crashes the actor, and this bubbles up to the session.
It ends up calling session:finalize
(with {failed, Reason}
) where we properly report the error (Call failed with reason ~p
), reschedule the call, and update the call log (though I'm not sure what's the state value):
It eventually calls session:terminate
(I don't know how) where we report the error again (session (~p) terminated with reason ~p
), but may also enqueue a delayed_job (CallFlow::FusionTablesPush::Pusher
)? The logic seems identical to CallLog#finish
in Ruby's model.
Now, it seems Verboice considers the call to still be open, when the Broker properly failed and reported. Maybe the CallLog state is incorrectly updated in session:finalize
? What about the delayed job?
I tried hard to reproduce this bug with no success. I did reproduce the timeout error and got the same logs in this issue.
My strongest hypothesis is very simple: this update fails.
As a consequence of this failure, in the Mumbai instance, 0.01% of the started calls remained "active" when they actually failed.
SELECT sum(case when finished_at is null then 1 else 0 end) `unfinished`,
count(1) `started`,
sum(case when finished_at is null then 1 else 0 end) / count(1) * 100 as `percentage`
FROM `call_logs`
WHERE started_at is not null
|------------|---------|------------|
| unfinished | started | percentage |
|------------|---------|------------|
| 237 | 1725747 | 0.0137 |
|------------|---------|------------|
huh... is it possible that we retain a MySQL connection which gets closed during the X minutes timeout? which is only noticed when we try to update (EPIPE)? could the connection be missing an auto-reconnect or something?
Today it seems had a couple of (consistent) occurrences of this issue in STG by hanging up calls using Callcentric. I share the logs. verboice_error_2.log verboice_error.log
As outlined by @ggiraldez, as soon as we get the hung up during the Gather operation, we try to update a row of call_logs
, but it's failing, and the js_context
column goes from null
to be an Erlang error:
9/22/2022 9:53:16 AM {error,
9/22/2022 9:53:16 AM {unrecognized_value,
9/22/2022 9:53:16 AM {#Ref<0.0.931.17542>,
9/22/2022 9:53:16 AM {dict,1,16,16,8,80,48,
9/22/2022 9:53:16 AM {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM []},
9/22/2022 9:53:16 AM {{[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM [[#Ref<0.0.931.17542>|
9/22/2022 9:53:16 AM {dict,5,16,16,8,80,48,
9/22/2022 9:53:16 AM {[],[],[],[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM [],[],[]},
9/22/2022 9:53:16 AM {{[[<<"hub_url">>]],
9/22/2022 9:53:16 AM [],[],
9/22/2022 9:53:16 AM [[<<"_get_var">>|
9/22/2022 9:53:16 AM #Fun<session.17.60946945>],
9/22/2022 9:53:16 AM [<<"split_digits">>|
9/22/2022 9:53:16 AM #Fun<session.18.60946945>]],
9/22/2022 9:53:16 AM [],[],
9/22/2022 9:53:16 AM [[<<"phone_number">>,49,48,49]],
9/22/2022 9:53:16 AM [[<<"record_url">>|
9/22/2022 9:53:16 AM #Fun<session.16.60946945>]],
9/22/2022 9:53:16 AM [],[],[],[],[],[],[],[]}}}]],
9/22/2022 9:53:16 AM [],[],[],[],[]}}}}}},
I reproduced with the Twilio simulator:
The simulator decided that a bunch of of respondents wouldn't pick up the phone (i.e. no reply kicked in), and reported it by sending the
no-answer
status to verboice, but maybe Twilio doesn't do that, because verboice didn't care and sent the first question, expecting an answer.
Maybe this is how the "capture timeout" above triggers in: Twilio reports an error status to Url
or <Redirect>
that verboice overlooks, it sends the next question in response (that twilio discards, it got a 200 status delivering the status and it's happy), then Verboice waits for an answer that will never come -> timeout during capture.
See #778 for a similar issue.
We've observed some calls that stay with state
active
for weeks in Verboice, even if it was finished/cancelled/errored. The one we've just seen (call in Verboice, broker's logs below) was sent via Twilio (call SIDCA381882b5e5f24f714d0ba28eee084f3c
inInSTEDD 4 NCD
project). The error was internal to the broker (maybe a communication issue with Surveda?), there are no errors seen in Twilio.We should check what the error is, and how to properly handle it.
CC: @ggiraldez