networkservicemesh / cmd-forwarder-vpp

Apache License 2.0
2 stars 22 forks source link

Segfault in forwarder on endpoint uninstall #504

Open richardstone opened 2 years ago

richardstone commented 2 years ago

Forwarder unexpectedly restarted with a segfault when then endpoint got removed.

The issue is rarely reproducible..

Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: received signal SIGSEGV, PC 0x7f274a25b520, faulting address 0x0
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #0  0x00007f2749cef640 unix_signal_handler + 0x100
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #1  0x00007f2749602050 0x7f2749602050
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #2  0x00007f274a25b520 vnet_hw_if_update_runtime_data + 0x1a30
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #3  0x00007f274a226347 vnet_delete_hw_interface + 0x87
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #4  0x00007f274a26d238 ethernet_delete_interface + 0x548
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #5  0x00007f274a7be8ed tap_delete_if + 0x13d
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #6  0x00007f274a7c3a30 vl_api_tap_delete_v2_t_handler + 0x70
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #7  0x00007f274af611a1 vl_msg_api_handler_no_free + 0xd1
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #8  0x00007f274af399a5 vl_socket_process_api_msg + 0x15
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #9  0x00007f274af437cb vl_api_clnt_process + 0xdeb
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #10 0x00007f2749c96966 vlib_process_bootstrap + 0x16
Feb 10 20:09:19.978 [INFO] [cmd:vpp] vpp[14]: #11 0x00007f27491d51c8 0x7f27491d51c8
Feb 10 20:17:13.298 [ERRO] [cmd:/bin/forwarder] signal: aborted
Feb 10 20:17:13.299 [ERRO] [cmd:/bin/forwarder] %!s(<nil>)

NSM component versions are v1.2.0-rc.1

forwarder-vpp-5r59k-forwarder-vpp-prev.txt

denis-tingaikin commented 2 years ago

@richardstone Could you enable trace logs in the forwarder https://github.com/networkservicemesh/deployments-k8s/blob/main/apps/forwarder-vpp/forwarder.yaml#L29-L30? This can super help us to figure out what is going on.

Could also you point out what applications are used? (Is it just https://github.com/networkservicemesh/cmd-nse-icmp-responder and https://github.com/networkservicemesh/cmd-nsc or something else?)

richardstone commented 2 years ago

@denis-tingaikin I've used https://github.com/networkservicemesh/cmd-nse-remote-vlan and https://github.com/networkservicemesh/cmd-nsc

I'll try to reproduce with trace logs, but as this is an intermittent issue, I'm not sure when will that be..

denis-tingaikin commented 2 years ago

@edwarnicke , @ljkiraly , @richardstone I think the best solution here is to try to add testing for VLAN remote case on CI. CI is running every day for a few times, so we'll catch any rare issue in some time :)

ljkiraly commented 2 years ago

Hello @denis-tingaikin, @edwarnicke,

I added more remote vlan mechanism related TCs in pull request https://github.com/networkservicemesh/deployment_k8s#4465 Also contains some fix for the failing test issue https://github.com/networkservicemesh/integration-k8s-kind/pull/584#issuecomment-1036579012 So the automation can be added after the deployment_k8s#4465 was merged.

Laszlo

richardstone commented 2 years ago

This issues is still popping up from time to time, it happened again last night.

I suspect VPP does not update the timestamp as logs from VPP looks to be having the very same timestamp for all lines it prints. In case of the forwarder where the segfault appeared, the pod was alive from Nov 2 20:45:00.588 until Nov 2 21:13:02.651, but this is all of the logs from VPP:

Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: buffer: numa[0] falling back to non-hugepage backed buffer pool (vlib_physmem_shared_map_create: pmalloc_map_pages: Unable to lock pages: Cannot allocate memory)
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: perfmon: skipping source 'intel-uncore' - intel_uncore_init: no uncore units found
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: perfmon: skipping source 'intel-core' - intel_core_init: not a IA-32 CPU
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_init_ca_chain:609: Could not initialize TLS CA certificates
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_mbedtls_init:644: failed to initialize TLS CA chain
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_init_ca_chain:976: Could not initialize TLS CA certificates
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_openssl_init:1050: failed to initialize TLS CA chain
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: vat-plug/load: vat_plugin_register: oddbuf plugin not loaded...
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: vl_api_want_interface_events_t_handler:1635: pid 1: already enabled...
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:103: unconfiguring policies for if 5 deleted
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:110: error deleting caiop (output): -2
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:116: error deleting caiop (input): -2
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: received signal SIGSEGV, PC 0x7f7bc9687129, faulting address 0x4
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #0  0x00007f7bc905bf20 unix_signal_handler + 0x150
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #1  0x00007f7bc898a8c0 0x7f7bc898a8c0
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #2  0x00007f7bc9687129 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #3  0x00007f7bca357ae2 virtio_input_node_fn_hsw + 0x38352
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #4  0x00007f7bc8ffa35f vlib_main_loop + 0x108f
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #5  0x00007f7bc8ff9272 vlib_main + 0x582
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #6  0x00007f7bc905b8e6 thread0 + 0x26
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #7  0x00007f7bc851c294 0x7f7bc851c294

@edwarnicke Should I open a ticket for the VPP timestamp issue?