skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

thread leak from qdrouterd on shutdown, reported from system_tests_one_router #116

Closed jiridanek closed 2 years ago

jiridanek commented 2 years ago

https://github.com/skupperproject/skupper-router/runs/5406854344?check_suite_focus=true#step:27:807

18: E           WARNING: ThreadSanitizer: thread leak (pid=1676)
18: E             Thread T3 (tid=1687, finished) created by main thread at:
18: E               #0 pthread_create <null> (libtsan.so.0+0x5bef5)
18: E               #1 sys_thread /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:181 (qdrouterd+0x47a7bc)
18: E               #2 qd_server_run /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1489 (qdrouterd+0x4c2eac)
18: E               #3 main_process /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:105 (qdrouterd+0x423ccc)
18: E               #4 main /home/runner/work/skupper-router/skupper-router/skupper-router/router/src/main.c:359 (qdrouterd+0x42318c)
18: E           
18: E             And 1 more similar thread leaks.
18: E           
18: E           SUMMARY: ThreadSanitizer: thread leak (/lib64/libtsan.so.0+0x5bef5) in pthread_create
18: E           ==================
18: E           ThreadSanitizer: reported 1 warnings

I don't rightly see how that possibly might happen

    for (i = 0; i < n; i++) {
        threads[i] = sys_thread(thread_run, qd_server);
    }
    thread_run(qd_server);      /* Use the current thread */
    for (i = 0; i < n; i++) {
        sys_thread_join(threads[i]);
        sys_thread_free(threads[i]);
    }
    free(threads);
jiridanek commented 2 years ago

tail of router log says

2022-03-03 12:32:07.666918 +0000 AGENT (info) Activating management agent on $_management_internal (/home/runner/work/skupper-router/skupper-router/skupper-router/python/qpid_dispatch_internal/management/agent.py:829)
2022-03-03 12:32:07.667154 +0000 ROUTER_CORE (trace) Core action 'subscribe' (/home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/router_core_thread.c:235)
2022-03-03 12:32:07.667229 +0000 ROUTER_CORE (info) In-process subscription L/$_management_internal (/home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/route_tables.c:646)
2022-03-03 12:32:07.669081 +0000 POLICY (info) Policy configured maxConnections: 65535, policyDir: '',access rules enabled: 'false', use hostname patterns: 'false' (/home/runner/work/skupper-router/skupper-router/skupper-router/src/policy.c:166)
2022-03-03 12:32:07.669316 +0000 AGENT (debug) Add entity: PolicyEntity(defaultVhost=$default, enableVhostNamePatterns=False, enableVhostPolicy=False, identity=policy/None, maxConnections=65535, maxMessageSize=0, name=policy/None, policyDir=, type=org.apache.qpid.dispatch.policy) (/home/runner/work/skupper-router/skupper-router/skupper-router/python/qpid_dispatch_internal/management/agent.py:602)
2022-03-03 12:32:07.676338 +0000 POLICY (info) Policy fallback defaultVhost is defined: '$default' (/home/runner/work/skupper-router/skupper-router/skupper-router/python/qpid_dispatch_internal/policy/policy_local.py:743)
2022-03-03 12:32:07.677479 +0000 CONN_MGR (info) Configured Listener: 0.0.0.0:24331 proto=any, role=normal (/home/runner/work/skupper-router/skupper-router/skupper-router/src/connection_manager.c:554)
2022-03-03 12:32:07.678271 +0000 AGENT (debug) Add entity: ListenerEntity(authenticatePeer=False, cost=1, healthz=True, host=0.0.0.0, http=False, identity=listener/0.0.0.0:24331, idleTimeoutSeconds=120, initialHandshakeTimeoutSeconds=0, maxFrameSize=16384, maxSessions=32768, messageLoggingComponents=none, metrics=True, name=listener/0.0.0.0:24331, port=24331, requireEncryption=False, requireSsl=False, role=normal, saslMechanisms=ANONYMOUS, stripAnnotations=both, type=org.apache.qpid.dispatch.listener, websockets=True) (/home/runner/work/skupper-router/skupper-router/skupper-router/python/qpid_dispatch_internal/management/agent.py:602)
2022-03-03 12:32:07.686990 +0000 SERVER (notice) Operational, 4 Threads Running (process ID 1676) (/home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1467)
2022-03-03 12:32:07.687307 +0000 SERVER (notice) Process VmSize 124.50 TiB (6.78 GiB available memory) (/home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:1478)
2022-03-03 12:32:07.695353 +0000 SERVER (error) Listener error on 0.0.0.0:24331: Address already in use - listen on 0.0.0.0:24331 (proton:io) (/home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:838)
2022-03-03 12:32:07.695458 +0000 SERVER (critical) Shutting down, required listener failed 0.0.0.0:24331 (/home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:842)
jiridanek commented 2 years ago

I'd call it as two minor issues

Since the thread leak happens in shutdown, it has no impact on production usage.

jiridanek commented 2 years ago

Regarding the port clash. Notice that test 16: system_tests_drain was running at the same time as test 18: system_tests_one_router. Both tests randomly picked port 24331 and bound to it. (With test 16 winning the race.)

Possible solutions:

  1. Use some sort of isolation, probably Linux namespaces, such as bwrap, which the Ubuntu GHA job already uses (this will be harder to do in a container)
  2. Increase the interval for picking ports, currently port_range = (20000, 30000)
  3. Establish some sort of machine-wide locking for port allocations that does not suffer from races. Some ideas
    1. https://github.com/google/python_portpicker (with portpicker server)
    2. some custom cross-process synchronization, either multiprocessing module (shared memory, locks), or fnctl file locking, or ctypes to call shm_open, ...

There are probably other suggestions like this on the old DISPATCH jira issues about this.

jiridanek commented 2 years ago

This looks related, https://github.com/skupperproject/skupper-router/issues/128, looks like http and amqp(?) traffic got somehow confused. Possibly the problem is caused by wrongly allocated port numbers somewhere?

jiridanek commented 2 years ago

There should no longer be port conflicts, thanks to #123. There is probably nothing that can be easily added to ensure test is talking to correct router (except that we make sure to avoid those port conflicts). The leak itself is filled as https://github.com/skupperproject/skupper-router/issues/156. Closing the issue.