pjsip / pjproject

PJSIP project
http://www.pjsip.org
GNU General Public License v2.0
2k stars 767 forks source link

pjsua2 media player block on OnEof2 callback when call hangup by other client #3604

Closed haozxuan closed 1 year ago

haozxuan commented 1 year ago

Describe the bug

I integrated pjsip as a sip client in my linux server to initiate a call to the sip server (outbound calls only, no receiving call scenarios). The testing and research process is very good, and it can work normally, but deadlocks occasionally appear. No longer accepting outbound tasks, but has been in a deadlock state;

Scenario description: According to the logs, every time the deadlock occurs when the client actively hangs up and the server is cleaning up tasks, I try to print a lot of debug logs to locate the specific behaviors that may cause the deadlock, but even so , the deadlock still exists;

Any ideas or direction would be greatly appreciated

Steps to reproduce

server side client side
1. Initialize the account and wait for the registration to complete
2. Initiate an outbound call and decide stopTranslate based on the number of calls
3. Clean up wav player hang up
4. Take the initiative to hang up

and some code like that

package class

import (
    "git.admin.com/uvms/pjsip/pjsua2"
    "go.uber.org/zap"
)

type Player struct {
    pjsua2.AudioMediaPlayer
    count    uint // Audio playback times (accumulated from 0)
    maxNum   uint // Maximum number of plays (1 or 2 or 3)
    audMed   pjsua2.AudioMedia
    recorder pjsua2.AudioMediaRecorder // Used to StopTransmit after playing the specified number of times
    call     pjsua2.Call               // Used to actively hang up after playing
    isIVR    bool                      // After the IVR voice is played, it will not actively hang up
}

func NewPlayer(num uint, audMed pjsua2.AudioMedia, recorder pjsua2.AudioMediaRecorder, call pjsua2.Call, isIVR bool) *Player {
    player := &Player{
        maxNum:   num,
        audMed:   audMed,
        recorder: recorder,
        call:     call,
        isIVR:    isIVR,
    }
    player.AudioMediaPlayer = pjsua2.NewDirectorAudioMediaPlayer(player)

    return player
}

// OnEof2
/**
 * Register a callback to be called when the playFile player reading has
 * reached the end of playFile, or when the playFile reading has reached the
 * end of playFile of the last playFile for a playlist. If the playFile or playlist
 * is set to play repeatedly, then the callback will be called multiple
 * times.
 *
 * If application wishes to stop the playback, it can stop the media
 * transmission in the callback, and only after all transmissions have
 * been stopped, could the application safely destroy the player.
 */
func (p *Player) OnEof2() {
    // https://trac.pjsip.org/repos/wiki/PJSUA_Locks#SoftDeadlock
    // Calling call in the callback may introduce soft deadlock. To avoid deadlock, pjsua adopts throw error after timeout, which needs to be captured to avoid panic
    defer func() {
        if r := recover(); r != nil {
            log.Error("*** Player OnEof2:", zap.Any("[panic]recover err", r))
        }
    }()

    p.count++
    log.Info("*** Player OnEof2:", zap.Uint("play times", p.count), zap.Uint("play max", p.maxNum),
        zap.Uintptr("call ptr", p.call.Swigcptr()), zap.Uintptr("call sound ptr", p.audMed.Swigcptr()),
        zap.Uintptr("wav sound ptr", p.Swigcptr()), zap.Uintptr("file record ptr", p.recorder.Swigcptr()))

    // The user side actively hangs up and cannot access any attributes on the call. It can only be verified by judging whether the pointer is empty.
    if p.call.Swigcptr() == 0 {
        log.Info("*** Player OnEof2: [user hangup] stop player/recorder and delete")
        p.StopTransmit(p.recorder)
        log.Info("*** Player OnEof2: [user hangup][doing] stop wav->recorder OK")
        pjsua2.DeleteAudioMediaRecorder(p.recorder)
        log.Info("*** Player OnEof2: [user hangup][doing] gc recorder OK")
        pjsua2.DeleteDirectorAudioMediaPlayer(p)
        log.Info("*** Player OnEof2: [user hangup][doing] gc wav player OK")
        return
    }

    if !p.isIVR && p.count == p.maxNum {
        // After playing, stop the transmission, destroy the player, and actively hang up the call
        log.Info("*** Player OnEof2: [server hangup] stop player/recorder and delete")
        p.StopTransmit(p.audMed)
        log.Info("*** Player OnEof2: [server hangup][doing] stop wav->call OK")

        p.StopTransmit(p.recorder)
        log.Info("*** Player OnEof2: [server hangup][doing] stop wav->recorder OK")

        p.audMed.StopTransmit(p.recorder)
        log.Info("*** Player OnEof2: [server hangup][doing] stop call->recorder OK")

        pjsua2.DeleteAudioMediaRecorder(p.recorder)
        log.Info("*** Player OnEof2: [server hangup][doing] gc record OK")

        pjsua2.DeleteDirectorAudioMediaPlayer(p)
        log.Info("*** Player OnEof2: [server hangup][doing] gc wav player OK")

        // MUST after StopTransmit
        p.call.Hangup(pjsua2.NewCallOpParam(true))
        log.Info("*** Player OnEof2: [server hangup][done] call hangup OK")
    }
}

PJSIP version

2.13

Context

Platform: Centos7.4 Config:config_site.h

// for dtmf
#define PJMEDIA_TELEPHONE_EVENT_ALL_CLOCKRATES 0
// for maximum performance
#define PJ_SCANNER_USE_BITWISE       0
#undef PJ_OS_HAS_CHECK_STACK
#define PJ_OS_HAS_CHECK_STACK        0
#define PJ_LOG_MAX_LEVEL             6
#define PJ_IOQUEUE_MAX_HANDLES       1024
#define PJSIP_MAX_TSX_COUNT          ((640*1024)-1)
#define PJSIP_MAX_DIALOG_COUNT       ((640*1024)-1)
#define PJSIP_UDP_SO_SNDBUF_SIZE     (24*1024*1024)
#define PJSIP_UDP_SO_RCVBUF_SIZE     (24*1024*1024)
#define PJ_DEBUG                     1
#define PJSIP_SAFE_MODULE            0
#define PJ_HAS_STRICMP_ALNUM         0
#define PJSIP_UNESCAPE_IN_PLACE      1
#define PJSUA_MAX_CALLS              512
#define PJSUA_MAX_ACC                512
#define PJSUA_MAX_PLAYERS            512
#define PJSUA_MAX_RECORDERS          512
#define PJSUA_MAX_CONF_PORTS         (PJSUA_MAX_CALLS+2*PJSUA_MAX_PLAYERS)

build:./configure --enable-epoll swig:-c++ -go -cgo -intgosize 64 -outcurrentdir -I/usr/local/include /data/pjproject-2.13/pjsip-apps/src/swig/pjsua2.i golang : 1.20

Based on historical issues, I have the following guesses:

  1. Pjsua-level locks and event callback processing logic internal locks, which affect each other
  2. Due to concurrency reasons, due to the existence of pjsua-level locks in the init account and add process, a deadlock occurs with the callback processing internal lock;

Verification process: Due to the very strict requirements for the callback on the server side and the active hangup on the client side, it only appears in online business for the time being, and it is not easy to reproduce locally.

sauwming commented 1 year ago

For deadlock issue, we need the stack traces so we can know which threads are involved in the deadlock and at which functions.

haozxuan commented 1 year ago

How do I cooperate, log level 5 or 6 and PJ_DEBUG 1? are these enough,Any ideas would be greatly appreciated

haozxuan commented 1 year ago

for deadlock by golang pprof debug, i found some like this

10 @ 0x4bf2fc 0x11a5725 0x1232236 0x123233c 0x12a6717 0x12ba576 0x12ba173 0xed730e 0xed70c8 0x52b001
#   0x11a5724   git.admin.com/uvms/pjsip/pjsua2._Cfunc__wrap_Account_create__SWIG_1_pjsua2_32805695c8aaaed9+0x44    _cgo_gotypes.go:4472
#   0x1232235   git.admin.com/uvms/pjsip/pjsua2.SwigcptrAccount.Create__SWIG_1+0x55             /data/runner_builds/go/pkg/mod/git.admin.com/uvms/pjsip@v0.0.0-20230605065620-c834a241644f/pjsua2/pjsua2.go:25093
#   0x123233b   git.admin.com/uvms/pjsip/pjsua2.SwigcptrAccount.Create+0xbb                 /data/runner_builds/go/pkg/mod/git.admin.com/uvms/pjsip@v0.0.0-20230605065620-c834a241644f/pjsua2/pjsua2.go:25099
#   0x12a6716   git.admin.com/uvms/pjsip/class.InitAccount+0x436                            /data/runner_builds/go/pkg/mod/git.admin.com/uvms/pjsip@v0.0.0-20230605065620-c834a241644f/class/account.go:71
#   0x12ba575   git.admin.com/uvms/consumer-vms/service/sip.(*Sip).call+0x2f5                   /data/runner_builds/eQMxWkS7/0/uvms/consumer-vms/service/sip/sip.go:107
#   0x12ba172   git.admin.com/uvms/consumer-vms/service/sip.(*Sip).handle+0x32                  /data/runner_builds/eQMxWkS7/0/uvms/consumer-vms/service/sip/sip.go:95
#   0xed730d    git.admin.com/uvms/utils/queue.(*Queue).start+0x1cd                     /data/runner_builds/go/pkg/mod/git.admin.com/uvms/utils@v1.0.2/queue/queue.go:102
#   0xed70c7    git.admin.com/uvms/utils/queue.(*Queue).Start.func1+0x107                       /data/runner_builds/go/pkg/mod/git.admin.com/uvms/utils@v1.0.2/queue/queue.go:70
sauwming commented 1 year ago

Yes, something like that. But deadlocks typically involve two or more threads. Example: https://github.com/pjsip/pjproject/pull/3492

haozxuan commented 1 year ago

Ok, I am going to modify the log level configuration, and publish it online, waiting for the next deadlock to occur

haozxuan commented 1 year ago

Thanks a lot @sauwming, marsking00.log marsking04.log marsking03.log marsking02.log marsking01.log Although there was no blocking phenomenon today, there was a crash situation. I desensitized the online log and submitted it;

BTW: Based on my personal understanding of the business, I always encounter problems when the client actively hangs up. I hope there are any suggestions or ways to properly handle this scenario

sauwming commented 1 year ago

There are 100+ MBs in the logs. Where exactly did the crash occur?

haozxuan commented 1 year ago

file marsking04.log and log [2023-06-22T16:25:36.251+0800 ERROR class/logger.go:56 {"": "16:25:36.251 media.cpp ev_thread pjsua_conf_disconnect(id, si nk.id) error: Invalid value or argument (PJ_EINVAL) (status=70004) [../src/pjsua2/media.cpp:235]"}]

I found him because of playermedia's SWIG not catch thrown error(in pjsua2_wrap.cxx file) image image

haozxuan commented 1 year ago

@sauwming Thank you very much for your attention,I have an idea, whether it is a deadlock or a crash, it is because I have not handled the user-side hang-up behavior well, because I cannot handle both the audio playback completion scene and the user-side hang-up scene in OnEof2, so I Made some transformations; through the golang channel, the logic in the callback event is extracted and processed in other coroutines, so that both can be taken into account. I have released this transformation online, and it looks very good now. I will Keep tracking for a while to make sure it's working

sauwming commented 1 year ago

Glad to hear that you manage to resolve the problem. Let us know if there's any issue that still needs to be addressed.