Closed spscream closed 1 year ago
The current version of Janus master is 1.1.3, please try with that one too.
okey, we will install it and i will return with logs from it. But i didn't see any changes which can fix this issue on master.
tha same happened on master
@spscream thanks for checking, would you please share an ASan stack trace from the current master?
@spscream if you also have ways that we can use to consistently replicate it ourselves that would help too!
I don't know hot to make repetition of this, it happens only on one environment
maybe this related to some buggy participant which calls unpublish twice in parallel, we are trying to figure it out
We got some more crashes again, they are in the same place as i provided before. In one case janus crashed when videoroom was destroyed while people were in it and in parallel clients were sending hangup requests - in the end of the group meeting organizer called termination and destroy were called as a part of it.
I can't provide any additional logs at moment. I'm trying to repeat crash myself by integration tests, but I have no success.
Since I suspect it's an issue with reference counting, where something is freed sooner than it should, one thing that might help is if you uncommented this line and recompiled Janus. It will make the logs explode in size, since it will be tracking all refs and unrefs, but should the problem occur again it would give us much more info to track where and why the affected resources was freed before its time. You may only want to do that in your own local deployment, maybe, rather than the one in production: your choice! But that would definitely help, especially if you're able to replicate it eventually.
Hi! we got some crashes and logs with refcount debug here: https://drive.google.com/drive/folders/10ppbofxkUNvnFJ0kxytvi5dNgxIkvojQ?usp=sharing
in janus.log is a stdout of janus, you can find exact moments of crash by janus version banner(Janus version: 1103 (1.1.3)).
Janus commit: 806ac371e9d829b880e0af62e7b6f24ba698218d
is not a valid commit hash, I guess we should refer to current master since the version is mentioning 1.1.3 ?
https://github.com/spscream/janus-gateway it refers to my fork with refs debug enabled
@spscream are you running janus on a 32-bit architecture?
no, we are running on 64bit
do you have any glue what is happening or how can I help to debug it?
Still investigating.
@spscream please test the PR above.
thank you! I build image with this pr and gave it to our customer
it is crashed again on 2 different sites 2 or 3 times. I can provide log only for one crash now. Waiting for crashes on other sites for more logs. https://drive.google.com/drive/folders/1CrIs9695SzIs21dtONOOUBHez_DIP3cb?usp=sharing
@spscream thanks for reporting, the patch has been updated with another fix.
@spscream have you had the chance to test the updated patch ?
@spscream have you had the chance to test the updated patch ?
yes. We await new crush log from customer. It is crashed again
we got the crash and log: https://drive.google.com/drive/folders/1ATMh-cWhvcVNOJ22El_bFk8DudlsKLdI?usp=sharing
@spscream thanks for the logs! The patch has been updated again, please test and report.
The reference that made it crash is not present in the log file, probably because it has been removed before the instant the file starts. Do you have the whole file?
no, they have they logs limited by size and this is all we have. We will wait for another crash
@atoppi i added archive with full log to the same folder
@spscream patch updated according to latest crash data, please test and report.
two days since running on latest patch and no crashes, we will run it for few more days
crashed again twice yesterday, do you need any additional info?
We're busy on other activities these days, sorry. We'll have a look as soon as possible. That said, I'll probably merge the existing PR as it is in the next few days, since it already solves many instances of crashes you had: the crash log you shared a few days ago seemed to be triggered in a different point, so may be worth investigating in a new PR.
@spscream please share the core dumps and logs
@spscream please share the core dumps and logs
unfortunately, we didn't get crash logs for yesterday crashes. If we will have any new crashes I will share logs for it.
I got crash from 17.03: https://drive.google.com/drive/folders/1qD3T4n77sz5e8--TGbSrL26WY2Aats-O?usp=sharing
I got crash from 17.03: https://drive.google.com/drive/folders/1qD3T4n77sz5e8--TGbSrL26WY2Aats-O?usp=sharing
This one happened into the core:
==1==ERROR: AddressSanitizer: heap-use-after-free on address 0x60400315b9e0 at pc 0x558859f37dc7 bp 0x7effbd418600 sp 0x7effbd4185f8
READ of size 4 at 0x60400315b9e0 thread T44963 (pool-janus)
#0 0x558859f37dc6 in janus_plugin_session_is_alive /build/janus-gateway/src/ice.c:750
#1 0x558859fa7b99 in janus_process_incoming_request /build/janus-gateway/src/janus.c:1686
#2 0x558859fb34c6 in janus_transport_task /build/janus-gateway/src/janus.c:3478
The thread is trying to read the atomic plugin_session->stopped
to check the session liveness, whereas the session was already been freed in the VR plugin:
freed by thread T45081 (hloop 497628586) here:
#0 0x7f00fbafab6f in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:123
#1 0x558859f1d605 in janus_ice_plugin_session_free /build/janus-gateway/src/ice.c:1591
#2 0x7f00f3f78b9c in janus_videoroom_session_free plugins/janus_videoroom.c:2542
#3 0x7f00f3fc878b in janus_videoroom_destroy_session plugins/janus_videoroom.c:4360
#4 0x558859f7fda7 in janus_ice_outgoing_traffic_handle /build/janus-gateway/src/ice.c:4500
This is the trace of the involved reference:
2023-03-17T13:11:50.171327+03:00: [ice.c:janus_ice_handle_attach_plugin:1382:init] 0x60400315b9e8 (1)
2023-03-17T13:11:50.171391+03:00: [ice.c:janus_ice_handle_attach_plugin:1384:increase] 0x60400315b9e8 (2)
2023-03-17T13:11:50.182581+03:00: [janus.c:janus_plugin_push_event:3592:increase] 0x60400315b9e8 (3)
2023-03-17T13:11:50.182842+03:00: [janus.c:janus_plugin_push_event:3671:decrease] 0x60400315b9e8 (2)
2023-03-17T13:11:54.288829+03:00: [janus.c:janus_plugin_push_event:3592:increase] 0x60400315b9e8 (3)
2023-03-17T13:11:54.289075+03:00: [janus.c:janus_plugin_push_event:3671:decrease] 0x60400315b9e8 (2)
2023-03-17T13:11:55.888737+03:00: [ice.c:janus_plugin_session_dereference:763:decrease] 0x60400315b9e8 (1)
2023-03-17T13:11:55.888797+03:00: Detaching handle from JANUS VideoRoom plugin; 0x612001a9a1c0 0x60400315b9d0 0x612001a9a1c0 0x607003756ea0
2023-03-17T13:11:55.889041+03:00: [janus.plugin.videoroom-0x60400315b9d0] No WebRTC media anymore; 0x612001a9a1c0 0x607003756ea0
2023-03-17T13:11:55.891391+03:00: [plugins/janus_videoroom.c:janus_videoroom_session_free:2542:decrease] 0x60400315b9e8 (0)
2023-03-17T13:11:55.893659+03:00: ==1==ERROR: AddressSanitizer: heap-use-after-free on address 0x60400315b9e0 at pc 0x558859f37dc7 bp 0x7effbd418600 sp 0x7effbd4185f8
The session refcount goes to 0 when handling a Detach
request, after the plugin is notified through its internal loop.
Concurrently, Janus is handling another message
request in the core (on another thread) and is failing in the check specifically meant to validate the involved plugin session, just before submitting the message
to the plugin.
We are facing many different race conditions here, but each of them is related to VR session teardown.
As anticipated, I merged #3167, so when we have a possible fix for these other issues we'll open a new PR.
Please test the PR above.
@spscream @zavndw any update? Can I assume the fix works and merge, since it's been two weeks?
@spscream @zavndw any update? Can I assume the fix works and merge, since it's been two weeks?
Yes. We right away assembled the update and delivered it to the customer. No complaints for two weeks
I'll merge and close then.
What version of Janus is this happening on? 1.1.0, 1.1.1
Have you tested a more recent version of Janus too? problem occured first on 1.1.0 and we upgraded to 1.1.1 and it steal appears.
Was this working before? this is occuring only on one of our customers servers and not occuring on others with the same version(1.1.0 and 1.1.1).
Is there a gdb or libasan trace of the issue? https://gist.github.com/spscream/6ee8265dbc7d0c8f7fa24c2c9c0f8eef
Additional context We use videoroom plugin mainly and this segmentation fault appers only on one customer. This customer also having some random issues with turn/stun servers and connectivity between janus and webrtc clients, so this could not be directly related to the crash, but maybe this leads to this segmentation fault.
maybe this duplicates https://github.com/meetecho/janus-gateway/issues/3124