ironcore-dev / dpservice

DPDK based fast Dataplane / L3 router / SDN enabler, installable on compute nodes / SmartNICs
Apache License 2.0
7 stars 1 forks source link

dpservice local test fails on one of the dell machines #583

Closed byteocean closed 2 months ago

byteocean commented 2 months ago

dpservice local test fails on one of the dell machines (dell-3). The GDB shows the following error stack trace:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x000055555597edbc in dp_ref_dec (ref=0x11fffd52c8) at ../include/dp_refcount.h:35
#2  0x000055555597f37f in dp_age_out_flow (flow_val=0x11fffd5200) at ../src/dp_flow.c:411
#3  dp_process_aged_flows_non_offload () at ../src/dp_flow.c:437
#4  0x0000555555702023 in dp_process_event_flow_aging_msg (m=0x17936a9c0) at ../src/monitoring/dp_event.c:120
#5  0x0000555555715eac in dp_process_event_msg (m=0x17936a9c0) at ../src/monitoring/dp_monitoring.c:22
#6  0x00005555558ed005 in handle_nongraph_queues () at ../src/nodes/rx_periodic_node.c:48
#7  rx_periodic_node_process (graph=0x11ffffd9c0, node=0x11ffffe380, objs=0x11fffea140, nb_objs=0) at ../src/nodes/rx_periodic_node.c:74
#8  0x00005555559d3850 in __rte_node_process (graph=0x11ffffd9c0, node=0x11ffffe380) at /usr/local/include/rte_graph_worker_common.h:186
#9  rte_graph_walk_rtc (graph=0x11ffffd9c0) at /usr/local/include/rte_graph_model_rtc.h:42
#10 0x00005555559d351f in rte_graph_walk (graph=0x11ffffd9c0) at /usr/local/include/rte_graph_worker.h:38
#11 0x00005555559d30df in graph_main_loop (arg=0x0) at ../src/dpdk_layer.c:100
#12 0x00007ffff7ae542f in eal_thread_loop () from /usr/local/lib/x86_64-linux-gnu/librte_eal.so.24
#13 0x00007ffff7afbe26 in eal_worker_thread_loop () from /usr/local/lib/x86_64-linux-gnu/librte_eal.so.24
#14 0x00007ffff70a645c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x00007ffff7126bbc in ?? () from /lib/x86_64-linux-gnu/libc.so.6

After a debugging session with @PlagueCZ, it is found out that swapping the order of union dp_ipv6 underlay_dst and uint8_t l4_type or make struct flow_nf_info packed can eliminate this error. The root cause still needs to be investigated.

PlagueCZ commented 2 months ago

We found out that flow_value->ref_count->release is NULL.

I suspect the problem is a possible double free of flows (once on LB destroy other one on aging). If you comment out dp_grpc_impl.c:150:dp_remove_lbtarget_flows(&ipv6); the problem stopped.

Because by using dp_ref references, the handler for flows is dp_free_flow() which calls rte_free() that actually memsets the memory to zero, which i think is the source of NULL in ref->release.

I have created a validation commit in https://github.com/ironcore-dev/dpservice/tree/fix/refcount that actually randomly fires on my machine.