skupperproject / skupper-router

An application-layer router for Skupper networks
https://skupper.io
Apache License 2.0
14 stars 18 forks source link

#1563 - Delete connections optionally on deletion of tcpListener/tcpConnector #1570

Closed gabordozsa closed 2 weeks ago

gabordozsa commented 1 month ago

Fixes #1563: Delete corresponding connections optionally on deletion of tcpListener and tcpConnector

New config parameters added to tcpListener and tcpConnector entities in order to make deletion of connections optional.

ganeshmurthy commented 1 month ago

Also, this PR needs to include a system test using vanflow - The test must start two routers, create a bunch of connections from a tcpListener on one router and then delete the tcpListener and validate that the associated connections are gone, same with tcpConnector (on the other router). The checking for the presence and deletion of a tcpConnection can be done using the vanflow API.

gabordozsa commented 1 month ago

Also, this PR needs to include a system test using vanflow - The test must start two routers, create a bunch of connections from a tcpListener on one router and then delete the tcpListener and validate that the associated connections are gone, same with tcpConnector (on the other router). The checking for the presence and deletion of a tcpConnection can be done using the vanflow API.

I'm working on this system test now (so I converted the PR to draft for now). Could you point me to an existing system test which could be a good starting point for me for this new test?

ganeshmurthy commented 1 month ago

Also, this PR needs to include a system test using vanflow - The test must start two routers, create a bunch of connections from a tcpListener on one router and then delete the tcpListener and validate that the associated connections are gone, same with tcpConnector (on the other router). The checking for the presence and deletion of a tcpConnection can be done using the vanflow API.

I'm working on this system test now (so I converted the PR to draft for now). Could you point me to an existing system test which could be a good starting point for me for this new test?

Please take a look at https://github.com/skupperproject/skupper-router/pull/1566 and see if you understand how the vanflow_snooper.py works and let me know if you have any questions

ganeshmurthy commented 1 month ago

@gabordozsa Just wanted to mention that if you are working on a github issue, please make sure you assign the issue to yourself before you start working on it. That will let other developers know who is responsible for this issue, so they will not work on that issue. Also if the issue is already assigned to someone, and you want to work on it, please contact that person to see if they are not already working on it, if not, you can that it be assigned to you. Thanks.

gabordozsa commented 1 month ago

@gabordozsa Just wanted to mention that if you are working on a github issue, please make sure you assign the issue to yourself before you start working on it. That will let other developers know who is responsible for this issue, so they will not work on that issue. Also if the issue is already assigned to someone, and you want to work on it, please contact that person to see if they are not already working on it, if not, you can that it be assigned to you. Thanks.

Sure, I will do! Thanks for the reminder and for assigning 1563 to me!

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 95.58824% with 9 lines in your changes missing coverage. Please review.

Project coverage is 62.5%. Comparing base (1b1d4b9) to head (35d4d8b). Report is 34 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1570 +/- ## ======================================= - Coverage 63.0% 62.5% -0.5% ======================================= Files 214 214 Lines 51857 52308 +451 Branches 5986 6146 +160 ======================================= + Hits 32681 32727 +46 - Misses 16966 17281 +315 - Partials 2210 2300 +90 ``` | [Flag](https://app.codecov.io/gh/skupperproject/skupper-router/pull/1570/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=skupperproject) | Coverage Δ | | |---|---|---| | [pysystemtests](https://app.codecov.io/gh/skupperproject/skupper-router/pull/1570/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=skupperproject) | `78.8% <98.1%> (+0.6%)` | :arrow_up: | | [systemtests](https://app.codecov.io/gh/skupperproject/skupper-router/pull/1570/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=skupperproject) | `55.5% <84.6%> (-0.7%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=skupperproject#carryforward-flags-in-the-pull-request-comment) to find out more. | [Components](https://app.codecov.io/gh/skupperproject/skupper-router/pull/1570/components?src=pr&el=components&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=skupperproject) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/skupperproject/skupper-router/pull/1570/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=skupperproject) | `62.5% <68.0%> (-0.5%)` | :arrow_down: | | [systemtests](https://app.codecov.io/gh/skupperproject/skupper-router/pull/1570/components?src=pr&el=component&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=skupperproject) | `62.5% <68.0%> (-0.5%)` | :arrow_down: |
gabordozsa commented 1 month ago

@ganeshmurthy I have added a system test. I think the PR is ready for more reviews.

ganeshmurthy commented 1 month ago

This test must also include another scenario where a TLS enabled tcpListener and TLS enabled tcpConnector are deleted. You can attach an sslProfile to a tcpListener or a tcpConnector to make them do TLS. Examples of this can be found in system_tests_tcp_adaptor_tls

ganeshmurthy commented 1 month ago

To test the TLS case, you can use the EchoClientRunner to start the echo client which can be found in system_tests_tcp_adaptor. It has the test_ssl flag which can be set to True

ganeshmurthy commented 1 month ago

@gabordozsa I introduced a commit - https://github.com/skupperproject/skupper-router/commit/f1f61ee2f5cbdc41e16d45f755eec26f5125000c - where I renamed the QdManager class to SkManager so it is more in line with the actual name of the tool skmanage Please rebase with the latest main branch and use the SkManager class in your PR

ganeshmurthy commented 1 month ago

@gabordozsa This looks good to me, great work ! I want @kgiusti to look at this and once he approves it, we can push this PR to main branch. He is back on Monday !

gabordozsa commented 1 month ago

@gabordozsa This looks good to me, great work ! I want @kgiusti to look at this and once he approves it, we can push this PR to main branch. He is back on Monday !

@ganeshmurthy Thanks! I pushed another commit to take the listener/connector lock when traversing the connections list. I think this is necessary because the list might be modified by another thread if a new connection is created or an existing one is closed. Does that make sense?

ganeshmurthy commented 1 month ago

Does that make sense?

Yes. Excellent catch.

grs commented 1 month ago

Re-asking a question here from https://github.com/skupperproject/skupper/pull/1558#issuecomment-2260242047:

Is there a reason for having this defined per connector/listener rather than either being simply the default behaviour or globally configured?

gabordozsa commented 1 month ago

Re-asking a question here from skupperproject/skupper#1558 (comment):

Is there a reason for having this defined per connector/listener rather than either being simply the default behaviour or globally configured?

My understanding is that we should keep the current default behaviour for v1. As for using a global config flag, the reasoning was that it would not simplify the implementation.

ganeshmurthy commented 1 month ago

Is there a reason for having this defined per connector/listener rather than either being simply the default behaviour or globally configured?

@grs I agree. I think we can move the closeConnectionsOnDelete flag to the global router entity instead of having it in each tcpConnector/tcpListener. Is that what you had in mind ?

grs commented 1 month ago

@grs I agree. I think we can move the closeConnectionsOnDelete flag to the global router entity instead of having it in each tcpConnector/tcpListener. Is that what you had in mind ?

That seems simpler for users who want to alter the default behaviour.

ganeshmurthy commented 1 month ago

@grs I agree. I think we can move the closeConnectionsOnDelete flag to the global router entity instead of having it in each tcpConnector/tcpListener. Is that what you had in mind ?

That seems simpler for users who want to alter the default behaviour.

Agree ! I think we should modify this PR to move the flag to the router entity. @gabordozsa @kgiusti What do you think ?

gabordozsa commented 1 month ago

@grs I agree. I think we can move the closeConnectionsOnDelete flag to the global router entity instead of having it in each tcpConnector/tcpListener. Is that what you had in mind ?

That seems simpler for users who want to alter the default behaviour.

Agree ! I think we should modify this PR to move the flag to the router entity. @gabordozsa @kgiusti What do you think ?

I'm fine with that. The flag may look at bit out of place in the 'router' entity because it is only relevant to tcpListeners and tcpConnectors. But I agree that would make easier to change the default behaviour.

ganeshmurthy commented 1 month ago

@grs I agree. I think we can move the closeConnectionsOnDelete flag to the global router entity instead of having it in each tcpConnector/tcpListener. Is that what you had in mind ?

That seems simpler for users who want to alter the default behaviour.

Agree ! I think we should modify this PR to move the flag to the router entity. @gabordozsa @kgiusti What do you think ?

I'm fine with that. The flag may look at bit out of place in the 'router' entity because it is only relevant to tcpListeners and tcpConnectors. But I agree that would make easier to change the default behaviour.

@gabordozsa There are already some fields in the router entity namely defaultDistribution, dataConnectionCount etc that are not directly pertinent to the router itself. We have used the router entity in the past to add attributes that apply to the router globally.

gabordozsa commented 1 month ago

@ganeshmurthy I moved the new config flag into the 'router' entity and adjusted the system tests.

gabordozsa commented 1 month ago

I renamed the new config flag to dropTcpConnections to be aligned with https://github.com/skupperproject/skupper/pull/1564

ganeshmurthy commented 1 month ago

I see this CI error -

 /home/runner/work/skupper-router/skupper-router/skupper-router/tests/system_test.py:1288: RuntimeError
---------------------- generated xml file: /__w/skupper-router/skupper-router/skupper-router/build/tests/junitxmls/system_tests_grpc.xml -----------------------
=================================================================== short test summary info ====================================================================
ERROR ::GrpcServiceMethodsTestOverTcpTls::test_grpc_04_client_stream - RuntimeError: Errors during teardown: 

Process 3933 (name=grpc-test-router) error: returned error code -11
skrouterd -c grpc-test-router.conf -I /home/runner/work/skupper-router/skupper-router/skupper-router/python
/__w/skupper-router/skupper-router/skupper-router/build/tests/system_test.dir/tests/system_tests_grpc/GrpcServiceMethodsTestOverTcpTls/setUpClass/grpc-test-router-5.cmd
>>>>
/home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/connections.c:338:5: runtime error: member access within null pointer of type 'struct qdr_connection_t'
    #0 0x66813f in qdr_core_close_connection /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/connections.c:338
    #1 0x5146f6 in qd_dispatch_delete_tcp_listener /home/runner/work/skupper-router/skupper-router/skupper-router/src/adaptors/tcp/tcp_adaptor.c:2397
    #2 0x7f54d3082055 in ffi_call_unix64 (/lib64/libffi.so.8+0x9055) (BuildId: a190bf03e644181cadab122962ab83ae96271696)
    #3 0x7f54d307e69f in ffi_call_int.lto_priv.0 (/lib64/libffi.so.8+0x569f) (BuildId: a190bf03e644181cadab122962ab83ae96271696)
    #4 0x7f54d30814ed in ffi_call (/lib64/libffi.so.8+0x84ed) (BuildId: a190bf03e644181cadab122962ab83ae96271696)
    #5 0x7f54d30992a7 in _ctypes_callproc (/usr/lib64/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so+0x102a7) (BuildId: e105d1b113f81c5a55ea7e4b1def4168d56667da)
    #6 0x7f54d3098c80 in PyCFuncPtr_call (/usr/lib64/python3.11/lib-dynload/_ctypes.cpython-311-x86_64-linux-gnu.so+0xfc80) (BuildId: e105d1b113f81c5a55ea7e4b1def4168d56667da)
    #7 0x7f54d555a972 in _PyObject_MakeTpCall (/lib64/libpython3.11.so.1.0+0x131972) (BuildId: dc5b8f1564a7d29af984bfb632f884ae4c52988d)
    #8 0x7f54d55630fe in _PyEval_EvalFrameDefault (/lib64/libpython3.11.so.1.0+0x13a0fe) (BuildId: dc5b8f1564a7d29af984bfb632f884ae4c52988d)
    #9 0x7f54d555f44e in _PyEval_Vector (/lib64/libpython3.11.so.1.0+0x13644e) (BuildId: dc5b8f1564a7d29af984bfb632f884ae4c52988d)
    #10 0x7f54d5597a8a in method_vectorcall (/lib64/libpython3.11.so.1.0+0x16ea8a) (BuildId: dc5b8f1564a7d29af984bfb632f884ae4c52988d)
    #11 0x7f54d5557ba6 in _PyObject_CallFunctionVa (/lib64/libpython3.11.so.1.0+0x12eba6) (BuildId: dc5b8f1564a7d29af984bfb632f884ae4c52988d)
    #12 0x7f54d5557adf in PyObject_CallFunction (/lib64/libpython3.11.so.1.0+0x12eadf) (BuildId: dc5b8f1564a7d29af984bfb632f884ae4c52988d)
    #13 0x6165c5 in qd_io_rx_handler /home/runner/work/skupper-router/skupper-router/skupper-router/src/python_embedded.c:658
    #14 0x6bfbd2 in qdr_forward_on_message /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/forwarder.c:375
    #15 0x6eeac0 in qdr_general_handler /home/runner/work/skupper-router/skupper-router/skupper-router/src/router_core/router_core.c:1031
    #16 0x7572e2 in qd_timer_visit /home/runner/work/skupper-router/skupper-router/skupper-router/src/timer.c:317
    #17 0x752288 in handle_proactor_other_event /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:142
    #18 0x7524bf in proactor_thread /home/runner/work/skupper-router/skupper-router/skupper-router/src/server.c:194
    #19 0x6124b5 in _thread_init /home/runner/work/skupper-router/skupper-router/skupper-router/src/posix/threading.c:207
    #20 0x7f54d595af95 in asan_thread_start(void*) (/lib64/libasan.so.8+0x5df95) (BuildId: 79824421bd82bb3ef4addf048e1265e2a93cfc64)
    #21 0x7f54d4494506 in start_thread (/lib64/libc.so.6+0x97506) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)
    #22 0x7f54d451840b in clone3 (/lib64/libc.so.6+0x11b40b) (BuildId: 8f53abaad945a669f2bdcd25f471d80e077568ef)

https://github.com/skupperproject/skupper-router/actions/runs/10254149016/job/28368308564?pr=1570#step:34:5924

ganeshmurthy commented 1 month ago

@gabordozsa there seem to be some CI failures in this PR like this one - https://github.com/skupperproject/skupper-router/actions/runs/10354198908/job/28658907680?pr=1570

I have kicked off the main branch CI just to make sure that it is clean, let's see what happens on main branch CI.

ganeshmurthy commented 1 month ago

@gabordozsa The main CI seems fine

I see this in the PR CI - https://github.com/skupperproject/skupper-router/actions/runs/10354198908/job/28658907680?pr=1570#step:39:600

ganeshmurthy commented 1 month ago

Leak of qd_tls_domain_t object https://github.com/skupperproject/skupper-router/actions/runs/10355435503/job/28663084819?pr=1570#step:34:5092

gabordozsa commented 1 month ago

Leak of qd_tls_domain_t object https://github.com/skupperproject/skupper-router/actions/runs/10355435503/job/28663084819?pr=1570#step:34:5092

@ganeshmurthy In the same error log I can see (https://github.com/skupperproject/skupper-router/actions/runs/10355435503/job/28663084819?pr=1570#step:34:4777):

alloc.c: Items of type 'qd_tcp_connection_t' remain allocated at shutdown: 1
Leak: 2024-08-12 16:14:00.677874 +0000 type: qd_tcp_connection_t address: 0x5160001f9610

I'm wondering if the problem could be in ADAPTOR_final() https://github.com/gabordozsa/skupper-router/blob/2c3629b9e17b5a9ce0bb7591d6d9e2dd9b715a52/src/adaptors/tcp/tcp_adaptor.c#L2648. ADAPTOR_final() iterates through the connections list and trigger closing them. In the main branch every accepted connection is added to the connections list. However, we delay adding it to the list now (until the link is setup). So if a new connection is accepted but ADAPTOR_final() is called before the link is setup for it then this conn can leak. Does that make sense?

ganeshmurthy commented 1 month ago

However, we delay adding it to the list now (until the link is setup). So if a new connection is accepted but ADAPTOR_final() is called before the link is setup for it then this conn can leak. Does that make sense?

Good catch @gabordozsa . It looks like the on_accept() was called and the connection was created there but before we could get a PN_RAW_CONNECTION_CONNECTED event from proton, the router seems to be shutting down which leaves the newly created connection (which is not yet in the list) to leak.

If you ask me, this should not really be a problem. The router is anyway shutting down, who cares about the leak ? But unfortunately, this is how we do things in the router.

gabordozsa commented 2 weeks ago

Closing this as code got merged via https://github.com/skupperproject/skupper-router/pull/1601