Closed aeatencio closed 2 years ago
That looks good to me.
@matiasgarciaisaia what do you think?
Sure, we can give it a try š
Have you thought about logging the previous call state, too? ie, "Call failed with reason ~p and new state ~p (was ~p)". It tends to call out "inconsistencies" better for me (I may be OK with it saying "new state in-progress", but I will definitely notice something off if it says "new state in-progress (was: finished)" or whatever.
Additionally, if we only think this logs will be useful for this specific bug, we can try monkey-patching the staging code instead of including the code into a release.
@matiasgarciaisaia, thanks for your help!
Additionally, if we only think this logs will be useful for this specific bug, we can try monkey-patching the staging code instead of including the code into a release.
Regarding 72b35bd, I don't see it as something only useful for finding this bug. I think it makes the codebase more consistent.
Regarding f292f30, I don't know, maybe we're missing (or maybe I didn't find it) a place where we store the call_log history? I think knowing that history could be useful for the future, not only while trying to find this bug, but trying to understand what happened with a specific call.
Also, have we ever reproduced this bug in staging? I thought one of the problems we have is that we don't know yet how to reproduce it, so improving our logs in production could help us in the future.
Do we know about more examples of this bug? Do we have these logs?
Let's merge, then! š
BTW @matiasgarciaisaia: maybe the backtrace you found is expected and misleading; maybe it fails later.
For example, when we get a failure, we try to schedule a retry (unless the retries are exhausted), and maybe we fail when trying to recontact? maybe there is something not reset properly, so that when we retry nothing happens?
I think there might be two things happening here together: the call fails for some reason (overloaded server? I don't know), and then we fail to mark the call as finished (ie, we miss a callback from Twilio, or end up with an exception that prevents the broker/web from updating the call state, or a race condition between the broker and the web both trying to update at the same time?).
So, yeah - I'm not pointing fingers, because I really don't have a suspect.
Sadly, this PR doesn't solve #900, but I expect it will help us to find a solution in the future.
Everything seems to point to the call state being properly updated to a different state than
active
when the call fails. I was able to reproduce a very similar error, obtaining very similar logs, but without reproducing the described effect.With these changes, we'll be even surer that the error is somewhere else. Where? I don't know, but at least in the future, we'll have more certainties and fewer places to look for the bug.
So, I'm sorry I didn't find the bug yet. We'll continue looking for you, little obscure bug!
Below are the mentioned logs, obtained while debugging by adding a
sleep()
of 10 secs here and reducing the timeout to 5 secs there:Thanks, @ysbaddaden for your brilliant analysis. It helped me a lot. Indeed, this PR tries to help answer your question:
I don't think so. But I expect these changes will help us to answer your question better in the future.