pjsip / pjproject

PJSIP project
http://www.pjsip.org
GNU General Public License v2.0
2.07k stars 786 forks source link

iocp: fix crash, GetQueuedCompletionStatus() write freed WSAOVERLAPPED memory #4136

Open jimying opened 2 weeks ago

jimying commented 2 weeks ago

try to fix issue #985

nanangizz commented 2 weeks ago

Great! I assume you've tested this and it does fix the issue :)

I think we also need to run under somekind of stress test, e.g: using ioqueue test in pjlib-test, to make sure all memory pools (of pending ops) are properly released. Note that the after an ioqueue key is unregistered, the key will be put into the closing-key-list and soon into the free-key-list to be reused by another socket. We need to make sure that all pending op has been freed before the key is freed & reused.

Next, perhaps we can apply a little bit optimization, e.g: instead of mem-pool for each pending-op, perhaps mem-pool per ioqueue-key to avoid multiple alloc+free for multiple pending-op, using same mechanism as ioqueue key (employing additional list for keeping unused pending-op instances to be reused later).

jimying commented 2 weeks ago

Note:
this only fix the issue when PJ_IOQUEUE_HAS_SAFE_UNREG=1;
PJ_IOQUEUE_HAS_SAFE_UNREG=0 still has memory error

nanangizz commented 2 weeks ago

Note: this only fix the issue when PJ_IOQUEUE_HAS_SAFE_UNREG=1; PJ_IOQUEUE_HAS_SAFE_UNREG=0 still has memory error

When PJ_IOQUEUE_HAS_SAFE_UNREG==1, the key won't be reused for 500ms (configurable via PJ_IOQUEUE_KEY_FREE_DELAY), this is a bit risky actually, e.g: on high CPU load, the cancellation may take longer?

nanangizz commented 1 week ago

Tried to run pjlib-test with this PR patch on VS2005 and pool debugging enabled (so pool uses normal malloc/free(), by setting PJ_POOL_DEBUG to 1 in config_site.h), I got an assertion:

    _wassert(const wchar_t * expr=0x004d1e90, const wchar_t * filename=0x004d1cf0, unsigned int lineno=588) Line 212    C
    pj_ioqueue_register_sock2(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, pj_grp_lock_t * grp_lock=0x00000000, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 588 + 0x2c bytes    C
    pj_ioqueue_register_sock(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 665 + 0x1f bytes  C
    unregister_test(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 554 + 0x23 bytes    C
    udp_ioqueue_test_imp(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 1190 + 0x9 bytes   C
    udp_ioqueue_test() Line 1255 + 0x9 bytes    C
    test_inner() Line 171 + 0x2a bytes  C
    test_main() Line 245 + 0x5 bytes    C

Not sure if this is the same issue, but this assertion does not happen when using ioqueue select.

jimying commented 1 week ago

@nanangizz no this patch, Is there this assert?

nanangizz commented 1 week ago

@nanangizz no this patch, Is there this assert?

Yes, same assert without this patch.

jimying commented 1 week ago

Tried to run pjlib-test with this PR patch on VS2005 and pool debugging enabled (so pool uses normal malloc/free(), by setting PJ_POOL_DEBUG to 1 in config_site.h), I got an assertion:

  _wassert(const wchar_t * expr=0x004d1e90, const wchar_t * filename=0x004d1cf0, unsigned int lineno=588) Line 212    C
  pj_ioqueue_register_sock2(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, pj_grp_lock_t * grp_lock=0x00000000, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 588 + 0x2c bytes    C
  pj_ioqueue_register_sock(pj_pool_t * pool=0x023a29b8, pj_ioqueue_t * ioqueue=0x023a2a2c, long sock=388, void * user_data=0x00000002, const pj_ioqueue_callback * cb=0x0019fb54, pj_ioqueue_key_t * * key=0x0019fbec) Line 665 + 0x1f bytes  C
  unregister_test(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 554 + 0x23 bytes    C
  udp_ioqueue_test_imp(const pj_ioqueue_cfg * cfg=0x0019fd34) Line 1190 + 0x9 bytes   C
  udp_ioqueue_test() Line 1255 + 0x9 bytes    C
  test_inner() Line 171 + 0x2a bytes  C
  test_main() Line 245 + 0x5 bytes    C

Not sure if this is the same issue, but this assertion does not happen when using ioqueue select.

I found the reason: key double unregister.
Fixed: when key->closing = 1, not unregister. now test passed.

nanangizz commented 3 days ago

Thanks @jimying .

Honestly I haven't got a chance to reproduce the original issue and test the proposed solution. I believe you are using this ioqueue in real world, experienced the issue, and find this solution does work, is that correct?

Next, here are few notes about the proposed solution:

Also, this ioqueue has been disabled for quite sometime and some improvement in the ioqueue area may not be integrated into this ioqueue, e.g: group lock for key. So please understand that there may still be some steps required to enable this ioqueue again :)

jimying commented 1 day ago

@nanangizz i write a simple demo to reproduce the crash issue in msys2, #4172

I have tested it, in old code, it can 100% reproduce the crash.

To test new code we can git cherry-pick the demo patch to this branch.