instedd / surveda

InSTEDD Surveda
https://instedd.org/technologies/surveda-mobile-surveys/
GNU General Public License v3.0
17 stars 6 forks source link

ChannelBroker: regular GenServer.call timeouts #2285

Open ysbaddaden opened 1 year ago

ysbaddaden commented 1 year ago

While running 2 surveys on my local, one to a verboice channel, and another to a nuntium channel, I regularly noticed crashes in the survey-broker. These didn't happen when a single survey was running.

surveda-app-1  | 2023-06-06T09:16:25.015Z [error] GenServer Ask.Runtime.SurveyBroker terminating
surveda-app-1  | ** (stop) exited in: GenServer.call(#PID<0.10920.2>, {:has_delivery_confirmation?}, 5000)
surveda-app-1  |     ** (EXIT) time out
surveda-app-1  |     (elixir 1.10.4) lib/gen_server.ex:1023: GenServer.call/3
surveda-app-1  |     (ask 0.31.0) lib/ask/runtime/session.ex:638: Ask.Runtime.Session.log_prompts/6
surveda-app-1  |     (ask 0.31.0) lib/ask/runtime/session.ex:423: Ask.Runtime.Session.mode_start/1
surveda-app-1  |     (ask 0.31.0) lib/ask/runtime/survey_broker.ex:302: Ask.Runtime.SurveyBroker.start/2
surveda-app-1  |     lib/mut.ex:262: Mutex.apply_with_lock/3
surveda-app-1  |     (elixir 1.10.4) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
surveda-app-1  |     (elixir 1.10.4) lib/enum.ex:783: Enum.each/2
surveda-app-1  |     (ask 0.31.0) lib/ask/runtime/survey_broker.ex:177: Ask.Runtime.SurveyBroker.poll_survey/3
surveda-app-1  |     (elixir 1.10.4) lib/enum.ex:783: Enum."-each/2-lists^foreach/1-0-"/2
surveda-app-1  |     (elixir 1.10.4) lib/enum.ex:783: Enum.each/2
surveda-app-1  |     (ask 0.31.0) lib/ask/runtime/survey_broker.ex:118: Ask.Runtime.SurveyBroker.poll_active_surveys/1
surveda-app-1  |     (ask 0.31.0) lib/ask/runtime/survey_broker.ex:52: Ask.Runtime.SurveyBroker.handle_info/3
surveda-app-1  |     (stdlib 3.12.1.2) gen_server.erl:637: :gen_server.try_dispatch/4
surveda-app-1  |     (stdlib 3.12.1.2) gen_server.erl:711: :gen_server.handle_msg/6
surveda-app-1  |     (stdlib 3.12.1.2) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
surveda-app-1  | Last message: :poll
surveda-app-1  | State: nil
surveda-app-1  | 2023-06-06T09:16:25.016Z [info] SurveyBroker started. Default batch size=10000. Limit per minute=100.
(...)
surveda-app-1  | 2023-06-06T09:16:25.598Z [info] MyXQL.Connection (#PID<0.483.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.16036.2> exited

Connecting an iex shell to the server, and running commands manually (e.g. Ask.Runtime.ChannelBroker.has_delivery_confirmation?(1)), I noticed that the channel-brokers sometimes replied quickly, but sometimes took seconds to respond, which might explain why the crash happens in the survey-broker, while the channel-broker is fine (the GenServer took too long to reply).

ysbaddaden commented 1 year ago

Note: I only noticed this specific GenServer.call for :has_delivery_confirmation? crash. I'd assume it would do the same for other calls (casts aren't affected).

ysbaddaden commented 1 year ago

Maybe this is related to the channel-broker-gc hogging the process while checking for active calls against the remote channel? We may have up to "channel capacity" pending contacts to check, which may hog the channel-broker process for over 5s :thinking:

ysbaddaden commented 1 year ago

It happened again a few times right when a time window end got reached for a survey:

surveda-app-1  | 2023-06-06T14:00:31.693Z [error] Ranch listener AskWeb.Endpoint.HTTP, connection process #PID<0.25757.5>, stream 1 had its request process #PID<0.25758.5> exit with reason {{:timeout, {GenServer, :call, [#PID<0.
9966.1>, {:has_delivery_confirmation?}, 5000]}}, {AskWeb.Endpoint, :call, [%Plug.Conn{adapter: {Plug.Cowboy.Conn, :...}, assigns: %{}, body_params: %Plug.Conn.Unfetched{aspect: :body_params}, cookies: %Plug.Conn.Unfetched{aspect
: :cookies}, halted: false, host: "app.surveda.lvh.me", method: "POST", owner: #PID<0.25758.5>, params: %Plug.Conn.Unfetched{aspect: :params}, path_info: ["callbacks", "verboice"], path_params: %{}, port: 80, private: %{}, query
_params: %Plug.Conn.Unfetched{aspect: :query_params}, query_string: "respondent=508321", remote_ip: {172, 18, 0, 7}, req_cookies: %Plug.Conn.Unfetched{aspect: :cookies}, req_headers: [{"connection", "close"}, {"content-length",
"13"}, {"content-type", "application/x-www-form-urlencoded"}, {"host", "app.surveda.lvh.me"}, {"te", ""}], request_path: "/callbacks/verboice", resp_body: nil, resp_cookies: %{}, resp_headers: [{"cache-control", "max-age=0, priv
ate, must-revalidate"}], scheme: :http, script_name: [], secret_key_base: nil, state: :unset, status: nil}, []]}} and stacktrace [{Phoenix.Endpoint.Cowboy2Handler, :exit_on_error, 4, [file: 'lib/phoenix/endpoint/cowboy2_handler.
ex', line: 95]}, {Phoenix.Endpoint.Cowboy2Handler, :init, 4, [file: 'lib/phoenix/endpoint/cowboy2_handler.ex', line: 57]}, {:cowboy_handler, :execute, 2, [file: '/deps/cowboy/src/cowboy_handler.erl', line: 37]}, {:cowboy_stream_
h, :execute, 3, [file: '/deps/cowboy/src/cowboy_stream_h.erl', line: 306]}, {:cowboy_stream_h, :request_process, 3, [file: '/deps/cowboy/src/cowboy_stream_h.erl', line: 295]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.er
l', line: 249]}]

It happened while there was multiple /callbacks/verboice/status/.../... calls. Definitely seems related to expiration in the channel broker.

ysbaddaden commented 1 year ago

Oh my, we went from the usual:

2023-06-06T14:00:08.222Z request_id=F2X0oNvMYOmvFckAGU_h [info] Sent 200 in 123ms

to some:

2023-06-06T14:00:13.214Z request_id=F2X0oOIpi37f_OcAB-Rj [info] Sent 200 in 5057ms

then lots of:

2023-06-06T14:00:26.064Z request_id=F2X0oOeMyOTd2KUAGVAh [info] Sent 200 in 17942ms
ysbaddaden commented 1 year ago

The issue happened on staging: https://instedd.sentry.io/issues/4265055830/ (has_delivery_confirmation?) along with a similar one https://instedd.sentry.io/issues/4265390854/ (has_queued_message?).