Closed agclark27 closed 4 years ago
Logs are not enough in this case, because they only tell us a transport instance has been used after it was destroyed somewhere. We'll need a complete log with refcount debugging enabled, as transports there are refcounted.
That said, where were these tests in the past two years the PR has been open, and I've repeatedly asked for feedback (including multiple "last calls" in the past few weeks)? :wink:
We're working to trying to be able to reproduce this in a testable way to be able to provide additional information beyond the AddressSanitizer outputs. We had been testing with the http-single-thread branch over the past couple of years, but we've only been observing these race conditions and issues when running on production servers under heavy load over the past week as a result of increased utilization for virtual events.
We haven't yet been able to reproduce the crash on-demand, but we're seeing a possible issue in janus_http.c where there is an access to transport with "session->longpolls = g_list_remove(session->longpolls, transport);" following the removal of it "janus_refcount_decrease(&transport->ref)" in a timeout condition. In other calls to g_list_remove, the call is happening prior to the call to janus_refcount_decrease. Are we reading this correctly?
https://github.com/meetecho/janus-gateway/blob/master/transports/janus_http.c#L948
Logs seem to mention like 943, though, not 948. Anyway, if that's the case, it would still crash at another point, because it means the count of references is off somewhere, and it would simply invert the order of decreases. That's why we'd need a list of calls, to track how the one that dies gets dereferenced too soon.
@agclark81 you may want to check if @tmatth's fix (which we now merged) helps.
Managed to reproduce on latest master.
Managed to reproduce on latest master.
Ok not a huge surprise (the NULL check would've been lucky).
Would it make sense to increase the transport session's refcount while it's in scope in janus_http_send_message
?
I don't think increasing counters now is useful, until we understand why it's going to 0 in the first place. The count may be fine, we may simply be doing an extra one somewhere that doesn't belong, or doing it too early, and adding an extra increase/decrease may be fixing the issue but introducing leaks. That's why I need a refcount debug.
Closing.
We're running a build from the latest master, including the fixes from #2000 (thank you!), and we're seeing frequent crashes that AddressSanitizer is pegging as a heap-use-after-free in janus_http_send_message transports/janus_http.c:943. It's not generating a coredump for us, but the service logs at the failure point follow a similar pattern:
Example 1: https://pastebin.com/8zHNHJGV Example 2: https://pastebin.com/WXE4GWJd