sangoma / switchy

async FreeSWITCH cluster control
https://switchy.readthedocs.io/en/latest/
Mozilla Public License 2.0
69 stars 18 forks source link

Add periodic stale session and call clearing #14

Closed goodboy closed 8 years ago

goodboy commented 9 years ago

When running load tests often there can be short drops in the ESL connection. The affect is that events pertaining to certain Call/Session objects are never received resulting in stale entries in the EventListener.calls and sessions maps. This then causes incorrect concurrent call counting and thus generated load.

We should probably add a periodic task which evaluates such entries by their most recent event's time stamp and clear them if they appear stale.

vodik commented 9 years ago

Would it make more sense to have a counter that reflects actual call status instead? This smells like garbage collection and maybe you'll end up with similar stalling problems more stalling problems? Is the garbage collector going to be able to run at the same time you're establishing more connections?

moises-silva commented 9 years ago

I agree this is something that probably shouldn't be implemented. If we're getting to the point where we lose tcp connection with the loader, we should first find out why is going down (what is causing the tcp connection to drop), and probably we shouldn't be testing at those load levels as you start having problems like this where you can't trust the information the load-tester cluster gives you.

goodboy commented 9 years ago

Would it make more sense to have a counter that reflects actual call status instead?

You can get the current call and session counts by taking the len of each collection: https://github.com/sangoma/switchy/blob/master/switchy/observe.py#L193 and https://github.com/sangoma/switchy/blob/master/switchy/observe.py#L173.

This smells like garbage collection and maybe you'll end up with similar stalling problems?

It is in the sense that I need to dereference the call and sessions objects from each collection or they'll never be removed. Why? because when the TCP connection dropped CHANNEL_HANGUP events that should have triggered their removal were never received. So then I'm left with these stale entries which then skews the load polling done by the call generator. So for example, if say 500 calls were left stale in the calls map then the call generator will read that there are 1000 calls when really there are only 500 truly active.

If we're getting to the point where we lose tcp connection with the loader, we should first find out why is going down (what is causing the tcp connection to drop)

If there's blips due to the network you don't think it's fine to just discard these entries once in a while? It may interrupt a longstanding test and can completely skew the measurements.

moises-silva commented 9 years ago

@tgoodlet A simple 'bleep' in the network wouldn't drop the tcp connection. Dropping the tcp connection is not a sign of just a bleep. TCP has mechanisms to not just abort the connection due to a bleep. In fact you can disconnect the cable, connect it again and chances are it will keep working. If the TCP connection is dropping I believe it's a problem that needs to be looked at rather than work around it.

goodboy commented 9 years ago

@moises-silva fair enough.

moises-silva commented 9 years ago

Just to be clear. I am all for making it more reliable and fault-tolerant, but doesn't seem like a priority since in our test environment we should have pristine network connectivity with the loaders.

vodik commented 9 years ago

Why? because when the TCP connection dropped CHANNEL_HANGUP events that should have triggered their removal were never received. So then I'm left with these stale entries which then skews the load polling done by the call generator.

Can you not tie this together with a socket timeout? But I'm wondering if something more serious is going on here. TCP shouldn't lose messages. Do you have on ESL connection per call? Per load generator? TCP will slow to an unusable crawl before it'll start giving up on delivering messages...

goodboy commented 8 years ago

This turns out to be more often due to lost events from slaves under high load. I added manual private methods for a user to clear state manually here: https://github.com/sangoma/switchy/blob/port_to_pandas/switchy/apps/call_gen.py#L544