meetecho / janus-gateway

Janus WebRTC Server
https://janus.conf.meetecho.com
GNU General Public License v3.0
8.17k stars 2.47k forks source link

[1.x] segmentation faults on videoroom hangup #3154

Closed spscream closed 1 year ago

spscream commented 1 year ago

What version of Janus is this happening on? 1.1.0, 1.1.1

Have you tested a more recent version of Janus too? problem occured first on 1.1.0 and we upgraded to 1.1.1 and it steal appears.

Was this working before? this is occuring only on one of our customers servers and not occuring on others with the same version(1.1.0 and 1.1.1).

Is there a gdb or libasan trace of the issue? https://gist.github.com/spscream/6ee8265dbc7d0c8f7fa24c2c9c0f8eef

Additional context We use videoroom plugin mainly and this segmentation fault appers only on one customer. This customer also having some random issues with turn/stun servers and connectivity between janus and webrtc clients, so this could not be directly related to the crash, but maybe this leads to this segmentation fault.

maybe this duplicates https://github.com/meetecho/janus-gateway/issues/3124

lminiero commented 1 year ago

The current version of Janus master is 1.1.3, please try with that one too.

spscream commented 1 year ago

okey, we will install it and i will return with logs from it. But i didn't see any changes which can fix this issue on master.

spscream commented 1 year ago

tha same happened on master

atoppi commented 1 year ago

@spscream thanks for checking, would you please share an ASan stack trace from the current master?

lminiero commented 1 year ago

@spscream if you also have ways that we can use to consistently replicate it ourselves that would help too!

spscream commented 1 year ago

sure, here is https://gist.github.com/spscream/b7ebe454303c3b044bc24f83cd8aea0e

spscream commented 1 year ago

I don't know hot to make repetition of this, it happens only on one environment

spscream commented 1 year ago

maybe this related to some buggy participant which calls unpublish twice in parallel, we are trying to figure it out

spscream commented 1 year ago

We got some more crashes again, they are in the same place as i provided before. In one case janus crashed when videoroom was destroyed while people were in it and in parallel clients were sending hangup requests - in the end of the group meeting organizer called termination and destroy were called as a part of it.

I can't provide any additional logs at moment. I'm trying to repeat crash myself by integration tests, but I have no success.

lminiero commented 1 year ago

Since I suspect it's an issue with reference counting, where something is freed sooner than it should, one thing that might help is if you uncommented this line and recompiled Janus. It will make the logs explode in size, since it will be tracking all refs and unrefs, but should the problem occur again it would give us much more info to track where and why the affected resources was freed before its time. You may only want to do that in your own local deployment, maybe, rather than the one in production: your choice! But that would definitely help, especially if you're able to replicate it eventually.

spscream commented 1 year ago

Hi! we got some crashes and logs with refcount debug here: https://drive.google.com/drive/folders/10ppbofxkUNvnFJ0kxytvi5dNgxIkvojQ?usp=sharing

in janus.log is a stdout of janus, you can find exact moments of crash by janus version banner(Janus version: 1103 (1.1.3)).

atoppi commented 1 year ago
Janus commit: 806ac371e9d829b880e0af62e7b6f24ba698218d

is not a valid commit hash, I guess we should refer to current master since the version is mentioning 1.1.3 ?

spscream commented 1 year ago

https://github.com/spscream/janus-gateway it refers to my fork with refs debug enabled

atoppi commented 1 year ago

@spscream are you running janus on a 32-bit architecture?

spscream commented 1 year ago

no, we are running on 64bit

spscream commented 1 year ago

do you have any glue what is happening or how can I help to debug it?

lminiero commented 1 year ago

Still investigating.

lminiero commented 1 year ago

@spscream please test the PR above.

spscream commented 1 year ago

thank you! I build image with this pr and gave it to our customer

spscream commented 1 year ago

it is crashed again on 2 different sites 2 or 3 times. I can provide log only for one crash now. Waiting for crashes on other sites for more logs. https://drive.google.com/drive/folders/1CrIs9695SzIs21dtONOOUBHez_DIP3cb?usp=sharing

atoppi commented 1 year ago

@spscream thanks for reporting, the patch has been updated with another fix.

atoppi commented 1 year ago

@spscream have you had the chance to test the updated patch ?

zavndw commented 1 year ago

@spscream have you had the chance to test the updated patch ?

yes. We await new crush log from customer. It is crashed again

spscream commented 1 year ago

we got the crash and log: https://drive.google.com/drive/folders/1ATMh-cWhvcVNOJ22El_bFk8DudlsKLdI?usp=sharing

atoppi commented 1 year ago

@spscream thanks for the logs! The patch has been updated again, please test and report.

spscream commented 1 year ago

it crashed again: https://drive.google.com/drive/folders/1dHMNm5hpa-wAnKstSuAEUCIquVyWIjZc?usp=sharing

atoppi commented 1 year ago

The reference that made it crash is not present in the log file, probably because it has been removed before the instant the file starts. Do you have the whole file?

spscream commented 1 year ago

no, they have they logs limited by size and this is all we have. We will wait for another crash

spscream commented 1 year ago

@atoppi i added archive with full log to the same folder

atoppi commented 1 year ago

@spscream patch updated according to latest crash data, please test and report.

spscream commented 1 year ago

two days since running on latest patch and no crashes, we will run it for few more days

spscream commented 1 year ago

crashed again today https://drive.google.com/drive/folders/1VXaoUhOt2eXnl9OtqsWyopt1oSz7WI23?usp=sharing

spscream commented 1 year ago

crashed again twice yesterday, do you need any additional info?

lminiero commented 1 year ago

We're busy on other activities these days, sorry. We'll have a look as soon as possible. That said, I'll probably merge the existing PR as it is in the next few days, since it already solves many instances of crashes you had: the crash log you shared a few days ago seemed to be triggered in a different point, so may be worth investigating in a new PR.

atoppi commented 1 year ago

@spscream please share the core dumps and logs

spscream commented 1 year ago

@spscream please share the core dumps and logs

unfortunately, we didn't get crash logs for yesterday crashes. If we will have any new crashes I will share logs for it.

spscream commented 1 year ago

I got crash from 17.03: https://drive.google.com/drive/folders/1qD3T4n77sz5e8--TGbSrL26WY2Aats-O?usp=sharing

atoppi commented 1 year ago

I got crash from 17.03: https://drive.google.com/drive/folders/1qD3T4n77sz5e8--TGbSrL26WY2Aats-O?usp=sharing

This one happened into the core:

==1==ERROR: AddressSanitizer: heap-use-after-free on address 0x60400315b9e0 at pc 0x558859f37dc7 bp 0x7effbd418600 sp 0x7effbd4185f8
READ of size 4 at 0x60400315b9e0 thread T44963 (pool-janus)
    #0 0x558859f37dc6 in janus_plugin_session_is_alive /build/janus-gateway/src/ice.c:750
    #1 0x558859fa7b99 in janus_process_incoming_request /build/janus-gateway/src/janus.c:1686
    #2 0x558859fb34c6 in janus_transport_task /build/janus-gateway/src/janus.c:3478

The thread is trying to read the atomic plugin_session->stopped to check the session liveness, whereas the session was already been freed in the VR plugin:

freed by thread T45081 (hloop 497628586) here:
   #0 0x7f00fbafab6f in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:123
   #1 0x558859f1d605 in janus_ice_plugin_session_free /build/janus-gateway/src/ice.c:1591
   #2 0x7f00f3f78b9c in janus_videoroom_session_free plugins/janus_videoroom.c:2542
   #3 0x7f00f3fc878b in janus_videoroom_destroy_session plugins/janus_videoroom.c:4360
   #4 0x558859f7fda7 in janus_ice_outgoing_traffic_handle /build/janus-gateway/src/ice.c:4500

This is the trace of the involved reference:

2023-03-17T13:11:50.171327+03:00: [ice.c:janus_ice_handle_attach_plugin:1382:init] 0x60400315b9e8 (1)
2023-03-17T13:11:50.171391+03:00: [ice.c:janus_ice_handle_attach_plugin:1384:increase] 0x60400315b9e8 (2)
2023-03-17T13:11:50.182581+03:00: [janus.c:janus_plugin_push_event:3592:increase] 0x60400315b9e8 (3)
2023-03-17T13:11:50.182842+03:00: [janus.c:janus_plugin_push_event:3671:decrease] 0x60400315b9e8 (2)
2023-03-17T13:11:54.288829+03:00: [janus.c:janus_plugin_push_event:3592:increase] 0x60400315b9e8 (3)
2023-03-17T13:11:54.289075+03:00: [janus.c:janus_plugin_push_event:3671:decrease] 0x60400315b9e8 (2)
2023-03-17T13:11:55.888737+03:00: [ice.c:janus_plugin_session_dereference:763:decrease] 0x60400315b9e8 (1)
2023-03-17T13:11:55.888797+03:00: Detaching handle from JANUS VideoRoom plugin; 0x612001a9a1c0 0x60400315b9d0 0x612001a9a1c0 0x607003756ea0
2023-03-17T13:11:55.889041+03:00: [janus.plugin.videoroom-0x60400315b9d0] No WebRTC media anymore; 0x612001a9a1c0 0x607003756ea0
2023-03-17T13:11:55.891391+03:00: [plugins/janus_videoroom.c:janus_videoroom_session_free:2542:decrease] 0x60400315b9e8 (0)
2023-03-17T13:11:55.893659+03:00: ==1==ERROR: AddressSanitizer: heap-use-after-free on address 0x60400315b9e0 at pc 0x558859f37dc7 bp 0x7effbd418600 sp 0x7effbd4185f8

The session refcount goes to 0 when handling a Detach request, after the plugin is notified through its internal loop. Concurrently, Janus is handling another message request in the core (on another thread) and is failing in the check specifically meant to validate the involved plugin session, just before submitting the message to the plugin.

https://github.com/meetecho/janus-gateway/blob/6ddaaa949c01d49e3ed51166a83ca02b7be5d4fc/src/janus.c#L1685-L1693

We are facing many different race conditions here, but each of them is related to VR session teardown.

lminiero commented 1 year ago

As anticipated, I merged #3167, so when we have a possible fix for these other issues we'll open a new PR.

zavndw commented 1 year ago

crashed again today https://drive.google.com/drive/folders/1oHJoU-KGZAZnFNsEhCPY6lQYfAbb4QJQ?usp=share_link

lminiero commented 1 year ago

Please test the PR above.

lminiero commented 1 year ago

@spscream @zavndw any update? Can I assume the fix works and merge, since it's been two weeks?

zavndw commented 1 year ago

@spscream @zavndw any update? Can I assume the fix works and merge, since it's been two weeks?

Yes. We right away assembled the update and delivered it to the customer. No complaints for two weeks

lminiero commented 1 year ago

I'll merge and close then.