Frequent service crashes since PR #1173

meetecho / janus-gateway

Janus WebRTC Server

https://janus.conf.meetecho.com

GNU General Public License v3.0

8.22k stars 2.48k forks source link

Frequent service crashes since PR #1173 #2005

Closed agclark27 closed 4 years ago

agclark27 commented 4 years ago

We're running a build from the latest master, including the fixes from #2000 (thank you!), and we're seeing frequent crashes that AddressSanitizer is pegging as a heap-use-after-free in janus_http_send_message transports/janus_http.c:943. It's not generating a coredump for us, but the service logs at the failure point follow a similar pattern:

Example 1: https://pastebin.com/8zHNHJGV Example 2: https://pastebin.com/WXE4GWJd

lminiero commented 4 years ago

Logs are not enough in this case, because they only tell us a transport instance has been used after it was destroyed somewhere. We'll need a complete log with refcount debugging enabled, as transports there are refcounted.

lminiero commented 4 years ago

That said, where were these tests in the past two years the PR has been open, and I've repeatedly asked for feedback (including multiple "last calls" in the past few weeks)? :wink:

agclark27 commented 4 years ago

We're working to trying to be able to reproduce this in a testable way to be able to provide additional information beyond the AddressSanitizer outputs. We had been testing with the http-single-thread branch over the past couple of years, but we've only been observing these race conditions and issues when running on production servers under heavy load over the past week as a result of increased utilization for virtual events.

agclark27 commented 4 years ago

We haven't yet been able to reproduce the crash on-demand, but we're seeing a possible issue in janus_http.c where there is an access to transport with "session->longpolls = g_list_remove(session->longpolls, transport);" following the removal of it "janus_refcount_decrease(&transport->ref)" in a timeout condition. In other calls to g_list_remove, the call is happening prior to the call to janus_refcount_decrease. Are we reading this correctly?

https://github.com/meetecho/janus-gateway/blob/master/transports/janus_http.c#L948

lminiero commented 4 years ago

Logs seem to mention like 943, though, not 948. Anyway, if that's the case, it would still crash at another point, because it means the count of references is off somewhere, and it would simply invert the order of decreases. That's why we'd need a list of calls, to track how the one that dies gets dereferenced too soon.

lminiero commented 4 years ago

@agclark81 you may want to check if @tmatth's fix (which we now merged) helps.

atoppi commented 4 years ago

Managed to reproduce on latest master.

libasan stack trace

tmatth commented 4 years ago

Managed to reproduce on latest master.

libasan stack trace

Ok not a huge surprise (the NULL check would've been lucky).

Would it make sense to increase the transport session's refcount while it's in scope in janus_http_send_message?

lminiero commented 4 years ago

I don't think increasing counters now is useful, until we understand why it's going to 0 in the first place. The count may be fine, we may simply be doing an extra one somewhere that doesn't belong, or doing it too early, and adding an extra increase/decrease may be fixing the issue but introducing leaks. That's why I need a refcount debug.

lminiero commented 4 years ago

Closing.