noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
265 stars 78 forks source link

Enhance TCP Socket handling #6472

Closed motorman-ibm closed 6 days ago

motorman-ibm commented 3 years ago

More information - Screenshots / Logs / Other output

Follow-up on slack discussion with Guy (and pasting stuff here)

Basic issue - tcp stack handling in current implementation of Noobaa with nodejs limits processing rate of a single Noobaa endpoint to 4GB/sec (roughly). It would be better if a single endpoint could run faster and saturate the physical node - that reduces redundant, independent cache pools required by each of the multiple endpoints in a node. (like the "ls" cache)

  1. worker threads are inside the GPFS stack all the time, so that's good

  2. But the main event loop thread is actually busy with TCP stack all the time:

    $ node ../perf-report-noobaa.js perf-script.traces -t=37894
    Options: { traces_file: 'perf-script.traces', verbose: false, tid: '37894' }
    [93.9%] TCP
    - [53.9%] copy_user_enhanced_fast_string | copyin | _copy_from_iter_full | tcp_sendmsg_locked | tcp_sendmsg | sock_sendmsg | sock_write_iter | new_sync_write | vfs_write
    - [ 1.6%] skb_release_data | __kfree_skb | tcp_clean_rtx_queue | tcp_ack | tcp_rcv_established | tcp_v4_do_rcv | __release_sock | __sk_flush_backlog | tcp_sendmsg_locked
    - [ 1.6%] nft_do_chain | nft_do_chain_ipv4 | nf_hook_slow | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_core | netif_receive_skb_internal |
    - [ 1.5%] tcp_sendmsg_locked | tcp_sendmsg | sock_sendmsg | sock_write_iter | new_sync_write | vfs_write | ksys_write | do_syscall_64 | entry_SYSCALL_64_after_hwframe | _
    - [ 1.2%] pskb_expand_head | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_core | netif_receive_skb_internal | br_pass_frame_up | br_handle_fr
    - [ 1.1%] _raw_spin_lock | sch_direct_xmit | __dev_queue_xmit | ip_finish_output2 | ip_output | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_
    - [ 0.8%] __list_del_entry_valid | get_page_from_freelist | __alloc_pages_nodemask | skb_page_frag_refill | sk_page_frag_refill | tcp_sendmsg_locked | tcp_sendmsg | sock_
    - [ 0.8%] __nf_conntrack_find_get | nf_conntrack_in | nf_hook_slow | br_nf_pre_routing | nf_hook_slow | br_handle_frame | __netif_receive_skb_core | process_backlog | net
    - [ 0.7%] get_page_from_freelist | __alloc_pages_nodemask | skb_page_frag_refill | sk_page_frag_refill | tcp_sendmsg_locked | tcp_sendmsg | sock_sendmsg | sock_write_iter
    - [ 0.7%] fib_table_lookup | __fib_validate_source | fib_validate_source | ip_route_input_slow | ip_route_input_rcu | ip_route_input_noref | ip_rcv_finish | ip_sabotage_i
    - [ 0.7%] __free_pages_ok | skb_release_data | __kfree_skb | tcp_clean_rtx_queue | tcp_ack | tcp_rcv_established | tcp_v4_do_rcv | __release_sock | __sk_flush_backlog | t
    - [ 0.7%] nft_immediate_eval | nft_do_chain | nft_do_chain_ipv4 | nf_hook_slow | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_core | netif_re
    - [ 0.6%] fib_table_lookup | ip_route_input_slow | ip_route_input_rcu | ip_route_input_noref | ip_rcv_finish | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_sk
    - [ 0.5%] nft_counter_eval | nft_do_chain | nft_do_chain_ipv4 | nf_hook_slow | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __netif_receive_skb_core | netif_rece
    - [ 0.5%] copy_user_enhanced_fast_string | copyin | _copy_from_iter_full | tcp_sendmsg_locked | tcp_sendmsg | sock_sendmsg | sock_write_iter | do_iter_readv_writev | do_i
    [ 2.6%] INFINIBAND
    - [ 1.3%] mlx5e_sq_xmit | mlx5e_xmit | dev_hard_start_xmit | sch_direct_xmit | __dev_queue_xmit | ip_finish_output2 | ip_output | ip_forward | ip_sabotage_in | nf_hook_sl
    - [ 0.3%] mlx5e_xmit | dev_hard_start_xmit | sch_direct_xmit | __dev_queue_xmit | ip_finish_output2 | ip_output | ip_forward | ip_sabotage_in | nf_hook_slow | ip_rcv | __
    [ 2.4%] NODEJS-V8
    - [ 0.1%] [unknown] | Builtins_AsyncFunctionAwaitResolveClosure | Builtins_PromiseFulfillReactionJob | Builtins_RunMicrotasks | Builtins_JSRunMicrotasksEntry | v8::intern
    [ 1.0%] OTHER
    - [ 0.3%] __x86_indirect_thunk_rax | __pthread_disable_asynccancel | [unknown]
    [ 0.0%] VFS
    - [ 0.0%] __fsnotify_parent | vfs_write | ksys_write | do_syscall_64 | entry_SYSCALL_64_after_hwframe | __pthread_disable_asynccancel | [unknown]
    [ 0.0%] PERF-EVENTS
    - [ 0.0%] native_write_msr | __intel_pmu_enable_all.constprop.25 | event_function | remote_function | flush_smp_call_function_queue | smp_call_function_single_interrupt |

    Per our discussion - nodejs cluster is not a viable solution - since it forks, it has the same issue as multiple endpoints

    From Guy -> What I'm thinking is that instead of using the cluster module, which hands over the entire request to the worker and therefore cannot reuse the caches, we can selectively do the hand over on read/write flows once we start streaming the data. Passing an http/tcp socket to a child process is simple - child_process_example_sending_a_socket_object so we can just execute the NSFS read-loop and write-loop in a worker that will off load the TCP work from main thread.

    This will probably require us to boost the number of worker threads, but that will be empirical testing after this change is implemented

motorman-ibm commented 2 years ago

This should move to enhancement. You (as a team) can also choose to close it. My understanding is you run fast enough per pod on a performance system. If you ever end up having to look at a per pod performance bottleneck I think it is better to have left this somewhere just so we document a possible solution.

github-actions[bot] commented 1 month ago

This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.

github-actions[bot] commented 6 days ago

This issue is stale and had no activity for too long - it will now be closed.