Improve logs and ensure update

aeatencio commented 2 years ago

Sadly, this PR doesn't solve #900, but I expect it will help us to find a solution in the future.

Everything seems to point to the call state being properly updated to a different state than active when the call fails. I was able to reproduce a very similar error, obtaining very similar logs, but without reproducing the described effect.

With these changes, we'll be even surer that the error is somewhere else. Where? I don't know, but at least in the future, we'll have more certainties and fewer places to look for the bug.

So, I'm sorry I didn't find the bug yet. We'll continue looking for you, little obscure bug!

Below are the mentioned logs, obtained while debugging by adding a sleep() of 10 secs here and reducing the timeout to 5 secs there:

18:28:23.062 [info] [<0.678.0>|693f9ece] session: Start
18:28:28.117 [error] [<0.678.0>|7c752752] session: Error during session "34ff7c3c-2a60-42ef-b925-563660501091": exit:{timeout,{gen_server,call,[<0.677.0>,{capture,{text,"en",<<"Foo">>},5,"#",1,1},5000]}}
18:28:28.153 [info] [<0.661.0>|f99f6a8d] session: Call failed with reason {internal_error,{timeout,{gen_server,call,[<0.677.0>,{capture,{text,"en",<<"Foo">>},5,"#",1,1},5000]}}} and new state failed
18:28:28.153 [warning] [<0.661.0>|f99f6a8d] session: Session ("34ff7c3c-2a60-42ef-b925-563660501091") terminated with reason: {timeout,{gen_server,call,[<0.677.0>,{capture,{text,"en",<<"Foo">>},5,"#",1,1},5000]}}
18:28:28.153 [error] [<0.661.0>|undefined] unknown_module: gen_fsm {session,[51,52,102,102,55,99,51,99,45,50,97,54,48,45,52,50,101,102,45,98,57,50,53,45,53,54,51,54,54,48,53,48,49,48,57,49]} in state in_progress terminated with reason: {timeout,{gen_server,call,[<0.677.0>,{capture,{text,"en",<<"Foo">>},5,"#",1,1},5000]}}
18:28:28.154 [error] [<0.661.0>|undefined] unknown_module: CRASH REPORT Process <0.661.0> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.677.0>,{capture,{text,"en",<<"Foo">>},5,"#",1,1},5000]}} in gen_fsm:terminate/7 line 611 
18:28:28.156 [error] [<0.145.0>|undefined] unknown_module: Supervisor session_sup had child "34ff7c3c-2a60-42ef-b925-563660501091" started with {session,start_link,undefined} at <0.661.0> exit with reason {timeout,{gen_server,call,[<0.677.0>,{capture,{text,"en",<<"Foo">>},5,"#",1,1},5000]}} in context child_terminated

Thanks, @ysbaddaden for your brilliant analysis. It helped me a lot. Indeed, this PR tries to help answer your question:

Maybe the CallLog state is incorrectly updated in session:finalize ?

I don't think so. But I expect these changes will help us to answer your question better in the future.

ysbaddaden commented 2 years ago

That looks good to me.

@matiasgarciaisaia what do you think?

matiasgarciaisaia commented 2 years ago

Sure, we can give it a try 👍

Have you thought about logging the previous call state, too? ie, "Call failed with reason ~p and new state ~p (was ~p)". It tends to call out "inconsistencies" better for me (I may be OK with it saying "new state in-progress", but I will definitely notice something off if it says "new state in-progress (was: finished)" or whatever.

matiasgarciaisaia commented 2 years ago

Additionally, if we only think this logs will be useful for this specific bug, we can try monkey-patching the staging code instead of including the code into a release.

aeatencio commented 2 years ago

@matiasgarciaisaia, thanks for your help!

Additionally, if we only think this logs will be useful for this specific bug, we can try monkey-patching the staging code instead of including the code into a release.

Regarding 72b35bd, I don't see it as something only useful for finding this bug. I think it makes the codebase more consistent.

Regarding f292f30, I don't know, maybe we're missing (or maybe I didn't find it) a place where we store the call_log history? I think knowing that history could be useful for the future, not only while trying to find this bug, but trying to understand what happened with a specific call.

Also, have we ever reproduced this bug in staging? I thought one of the problems we have is that we don't know yet how to reproduce it, so improving our logs in production could help us in the future.

Do we know about more examples of this bug? Do we have these logs?

matiasgarciaisaia commented 2 years ago

Let's merge, then! 👍

ysbaddaden commented 2 years ago

BTW @matiasgarciaisaia: maybe the backtrace you found is expected and misleading; maybe it fails later.

For example, when we get a failure, we try to schedule a retry (unless the retries are exhausted), and maybe we fail when trying to recontact? maybe there is something not reset properly, so that when we retry nothing happens?

matiasgarciaisaia commented 2 years ago

I think there might be two things happening here together: the call fails for some reason (overloaded server? I don't know), and then we fail to mark the call as finished (ie, we miss a callback from Twilio, or end up with an exception that prevents the broker/web from updating the call state, or a race condition between the broker and the web both trying to update at the same time?).

So, yeah - I'm not pointing fingers, because I really don't have a suspect.

instedd / verboice

Improve logs and ensure update #909