Open ysbaddaden opened 1 year ago
Note: I only noticed this specific GenServer.call
for :has_delivery_confirmation?
crash. I'd assume it would do the same for other calls (casts aren't affected).
Maybe this is related to the channel-broker-gc hogging the process while checking for active calls against the remote channel? We may have up to "channel capacity" pending contacts to check, which may hog the channel-broker process for over 5s :thinking:
It happened again a few times right when a time window end got reached for a survey:
surveda-app-1 | 2023-06-06T14:00:31.693Z [error] Ranch listener AskWeb.Endpoint.HTTP, connection process #PID<0.25757.5>, stream 1 had its request process #PID<0.25758.5> exit with reason {{:timeout, {GenServer, :call, [#PID<0.
9966.1>, {:has_delivery_confirmation?}, 5000]}}, {AskWeb.Endpoint, :call, [%Plug.Conn{adapter: {Plug.Cowboy.Conn, :...}, assigns: %{}, body_params: %Plug.Conn.Unfetched{aspect: :body_params}, cookies: %Plug.Conn.Unfetched{aspect
: :cookies}, halted: false, host: "app.surveda.lvh.me", method: "POST", owner: #PID<0.25758.5>, params: %Plug.Conn.Unfetched{aspect: :params}, path_info: ["callbacks", "verboice"], path_params: %{}, port: 80, private: %{}, query
_params: %Plug.Conn.Unfetched{aspect: :query_params}, query_string: "respondent=508321", remote_ip: {172, 18, 0, 7}, req_cookies: %Plug.Conn.Unfetched{aspect: :cookies}, req_headers: [{"connection", "close"}, {"content-length",
"13"}, {"content-type", "application/x-www-form-urlencoded"}, {"host", "app.surveda.lvh.me"}, {"te", ""}], request_path: "/callbacks/verboice", resp_body: nil, resp_cookies: %{}, resp_headers: [{"cache-control", "max-age=0, priv
ate, must-revalidate"}], scheme: :http, script_name: [], secret_key_base: nil, state: :unset, status: nil}, []]}} and stacktrace [{Phoenix.Endpoint.Cowboy2Handler, :exit_on_error, 4, [file: 'lib/phoenix/endpoint/cowboy2_handler.
ex', line: 95]}, {Phoenix.Endpoint.Cowboy2Handler, :init, 4, [file: 'lib/phoenix/endpoint/cowboy2_handler.ex', line: 57]}, {:cowboy_handler, :execute, 2, [file: '/deps/cowboy/src/cowboy_handler.erl', line: 37]}, {:cowboy_stream_
h, :execute, 3, [file: '/deps/cowboy/src/cowboy_stream_h.erl', line: 306]}, {:cowboy_stream_h, :request_process, 3, [file: '/deps/cowboy/src/cowboy_stream_h.erl', line: 295]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.er
l', line: 249]}]
It happened while there was multiple /callbacks/verboice/status/.../...
calls. Definitely seems related to expiration in the channel broker.
Oh my, we went from the usual:
2023-06-06T14:00:08.222Z request_id=F2X0oNvMYOmvFckAGU_h [info] Sent 200 in 123ms
to some:
2023-06-06T14:00:13.214Z request_id=F2X0oOIpi37f_OcAB-Rj [info] Sent 200 in 5057ms
then lots of:
2023-06-06T14:00:26.064Z request_id=F2X0oOeMyOTd2KUAGVAh [info] Sent 200 in 17942ms
The issue happened on staging: https://instedd.sentry.io/issues/4265055830/ (has_delivery_confirmation?
) along with a similar one https://instedd.sentry.io/issues/4265390854/ (has_queued_message?
).
While running 2 surveys on my local, one to a verboice channel, and another to a nuntium channel, I regularly noticed crashes in the survey-broker. These didn't happen when a single survey was running.
Connecting an
iex
shell to the server, and running commands manually (e.g.Ask.Runtime.ChannelBroker.has_delivery_confirmation?(1)
), I noticed that the channel-brokers sometimes replied quickly, but sometimes took seconds to respond, which might explain why the crash happens in the survey-broker, while the channel-broker is fine (the GenServer took too long to reply).