Open richardstone opened 2 years ago
@richardstone Could you enable trace logs in the forwarder https://github.com/networkservicemesh/deployments-k8s/blob/main/apps/forwarder-vpp/forwarder.yaml#L29-L30? This can super help us to figure out what is going on.
Could also you point out what applications are used? (Is it just https://github.com/networkservicemesh/cmd-nse-icmp-responder and https://github.com/networkservicemesh/cmd-nsc or something else?)
@denis-tingaikin I've used https://github.com/networkservicemesh/cmd-nse-remote-vlan and https://github.com/networkservicemesh/cmd-nsc
I'll try to reproduce with trace logs, but as this is an intermittent issue, I'm not sure when will that be..
@edwarnicke , @ljkiraly , @richardstone I think the best solution here is to try to add testing for VLAN remote case on CI. CI is running every day for a few times, so we'll catch any rare issue in some time :)
Hello @denis-tingaikin, @edwarnicke,
I added more remote vlan mechanism related TCs in pull request https://github.com/networkservicemesh/deployment_k8s#4465 Also contains some fix for the failing test issue https://github.com/networkservicemesh/integration-k8s-kind/pull/584#issuecomment-1036579012 So the automation can be added after the deployment_k8s#4465 was merged.
Laszlo
This issues is still popping up from time to time, it happened again last night.
I suspect VPP does not update the timestamp as logs from VPP looks to be having the very same timestamp for all lines it prints. In case of the forwarder where the segfault appeared, the pod was alive from Nov 2 20:45:00.588
until Nov 2 21:13:02.651
, but this is all of the logs from VPP:
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: buffer: numa[0] falling back to non-hugepage backed buffer pool (vlib_physmem_shared_map_create: pmalloc_map_pages: Unable to lock pages: Cannot allocate memory)
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: perfmon: skipping source 'intel-uncore' - intel_uncore_init: no uncore units found
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: perfmon: skipping source 'intel-core' - intel_core_init: not a IA-32 CPU
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_init_ca_chain:609: Could not initialize TLS CA certificates
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_mbedtls_init:644: failed to initialize TLS CA chain
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_init_ca_chain:976: Could not initialize TLS CA certificates
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_openssl_init:1050: failed to initialize TLS CA chain
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: vat-plug/load: vat_plugin_register: oddbuf plugin not loaded...
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: vl_api_want_interface_events_t_handler:1635: pid 1: already enabled...
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:103: unconfiguring policies for if 5 deleted
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:110: error deleting caiop (output): -2
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:116: error deleting caiop (input): -2
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: received signal SIGSEGV, PC 0x7f7bc9687129, faulting address 0x4
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #0 0x00007f7bc905bf20 unix_signal_handler + 0x150
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #1 0x00007f7bc898a8c0 0x7f7bc898a8c0
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #2 0x00007f7bc9687129 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #3 0x00007f7bca357ae2 virtio_input_node_fn_hsw + 0x38352
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #4 0x00007f7bc8ffa35f vlib_main_loop + 0x108f
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #5 0x00007f7bc8ff9272 vlib_main + 0x582
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #6 0x00007f7bc905b8e6 thread0 + 0x26
Nov 2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #7 0x00007f7bc851c294 0x7f7bc851c294
@edwarnicke Should I open a ticket for the VPP timestamp issue?
Forwarder unexpectedly restarted with a segfault when then endpoint got removed.
The issue is rarely reproducible..
NSM component versions are v1.2.0-rc.1
forwarder-vpp-5r59k-forwarder-vpp-prev.txt