instedd / surveda

InSTEDD Surveda
https://instedd.org/technologies/surveda-mobile-surveys/
GNU General Public License v3.0
17 stars 6 forks source link

ChannelBroker process will never timeout #2274

Open ysbaddaden opened 1 year ago

ysbaddaden commented 1 year ago

With the GC revamp in #2263 the channel-broker processes will receive a message every some minutes at worst, so GenServer will never send a :timeout and the channel-broker process will never be discontinued.

We should replace the timeout mecanism. There should be a regular call (e.g. every 30 minutes) to verify whether the channel broker has nothing left to do (empty queue and no active calls) and decide to terminate the process, after making sure to delete the state from the channel-broker-agent.

ysbaddaden commented 1 year ago

Reading the description of GenServer, the timeout will indeed never be reached, because of the regular calls to the GC. We must remove the default timeout handling with GenServer, and implement our own solution.

Solution A:

The handle_info({:collect_garbage}, state) callback can detect when there is nothing to do (no active/queued contacts) and record the time it happened, then on subsequent calls if there is still nothing to do, we can compare the recorded time and eventually terminate the process when the timeout has been reached.

But what if we got messages handled in-between calls to :collect_garbage? Maybe that's not a problem. It's not particularly expensive to start/stop an Erlang process.

Solution B:

An alternative (maybe simpler) solution could be to save the time at which we last saved the state (in ChannelBrokerAgent.save_state/1), and have State.process_timeout/1 return a new timeout based on this recorded time. Even if the :collect_garbage message is sent every 2 minutes, we'd eventually reach a point where the timeout is below those 2 minutes and will happen before the :collect_garbage message is sent.

This requires handle_info({:collect_garbage}) to not save the state when there are no active/queued contacts (to not influence the timeout).

manumoreira commented 8 months ago

@ysbaddaden Can you help evaluating the severity of this bug. What are the risks involved on the broker not timeouting?

ysbaddaden commented 8 months ago

Low.

Unless we have lots of channels on a single instance, which might happen if we start grouping instances. It consumes a bit of memory and a few calls every N minutes for each channel.

It shouldn't be hard to fix.