ironcore-dev / dpservice

DPDK based fast Dataplane / L3 router / SDN enabler, installable on compute nodes / SmartNICs
Apache License 2.0
7 stars 1 forks source link

Multicore support #123

Open PlagueCZ opened 1 year ago

PlagueCZ commented 1 year ago

Summary

This is a future-improvement feature. dp-service does not work properly with multiple worker cores.

Currently this issue is here as a place to put findings to be used later. No bugfixing/investigation is needed at the time.

PlagueCZ commented 1 year ago

Once there is traffic, graph get corrupted. See DPDK's rte_graph_worker.h, rte_graph_walk(), in the main loop, the graph arrives in a state where node is invalid (fence check fails, etc.)

If I understand it correctly, the entry-level graph nodes are at a relative index in relation to cir_start. Then, when those nodes send the packet to another node, it gets added as a positive index.

Forgive the crude debug output, but I think it can explain a little (plaintext lines are from node loops, address line is from rte_graph_walk() loop with the node name at the end.

rx node burst 1
0x100c30dc0, head: fffffff7, tail: 1, start: 0x100c30140, mask: 3f, head: 1, cirstart: d80; 'cls' <- ''
cls node packet
rx node burst 1
0x100c32d00, head: fffffff7, tail: 2, start: 0x100c30140, mask: 3f, head: 2, cirstart: 2cc0; 'drop' <- ''
drop node packet
0x100c30dc0, head: fffffff7, tail: 3, start: 0x100c30140, mask: 3f, head: 3, cirstart: d80; 'cls' <- ''
cls node packet
0x100c32d00, head: fffffff7, tail: 4, start: 0x100c30140, mask: 3f, head: 4, cirstart: 2cc0; 'drop' <- ''
drop node packet
0x100c30dc0, head: fffffff7, tail: 3, start: 0x100c30140, mask: 3f, head: 1, cirstart: d80; 'cls' <- ''
0x100c32d00, head: fffffff7, tail: 0, start: 0x100c30140, mask: 3f, head: 2, cirstart: 2cc0; 'drop' <- ''
0x100c30dc0, head: fffffff7, tail: 1, start: 0x100c30140, mask: 3f, head: 1, cirstart: d80; 'cls' <- ''
0x100c32d00, head: fffffff7, tail: 0, start: 0x100c30140, mask: 3f, head: 2, cirstart: 2cc0; 'drop' <- ''
0x100c30dc0, head: fffffff7, tail: 0, start: 0x100c30140, mask: 3f, head: 3, cirstart: d80; 'cls' <- ''
0x100c32d00, head: fffffff7, tail: 0, start: 0x100c30140, mask: 3f, head: 4, cirstart: 2cc0; 'drop' <- ''
0x100c30040, head: fffffff7, tail: 0, start: 0x100c30140, mask: 3f, head: 5, cirstart: 0

The last line will cause a node with invalid fence to be processed, leading to a SIGSEGV.

It seems that there are two packets (I had a VM connected that started DHCP). They go though classify node and a drop node. But somehow they are processed in a non-thread-safe way, so there remains an invalid record that then gets processed too.

guvenc commented 2 months ago

As dpservice has introduced DPDK 23.11 support, we could evaluate graph library multicore support introduced with its previous release, DPDK 23.07 (mcore dispatch).

https://doc.dpdk.org/guides/rel_notes/release_23_07.html