ironcore-dev / dpservice

DPDK based fast Dataplane / L3 router / SDN enabler, installable on compute nodes / SmartNICs
Apache License 2.0
7 stars 1 forks source link

Introduce multiport-eswitch mode to support continuous communication #589

Closed byteocean closed 1 month ago

byteocean commented 2 months ago

This PR introduced multiport-eswitch mode to support continues communication when one of the uplinks is down for reasons like maintenance. It includes:

1) a modified prepare.sh script to enable multiport-eswitch mode on host; 2) proper port initialisation using the async template approach that support multiport eswitch configuration; 3) a workaround solution that mitigate failed pf1 communication once this mode is enabled; 4) virtualservice implementation that uses the async template method and code enhancement for the above features (@PlagueCZ ).

To use prepare.sh script to configure the host and generate new dp_service.conf file, run ./prepare.sh --multiport-eswitch --pf1-proxy.

To run dpservice to coop with this multiport eswitch mode and use the workaround, run sudo ./build/src/dpservice-bin -l 0,1 -n 2 -r 2 -- --no-stats --no-offload --multiport-eswitch

A branch that also contains tap-based tests are also available, but it needs further discussion once tap-based workaround needs to stay longer.

PlagueCZ commented 1 month ago

Thanks Tao and Jay. I tested it functionally and looks good but I have some comments though and the following happened to me:

If I run the standard prepare.sh with enabled proxy_pf1 and multiport switch and run the dpservice-bin with generated config I get this: (My firmware setting has maximum possible VFs 126)

E SERVICE: Port id too high for Rx nodes, value: 128, max: 128 [../src/nodes/rx_node.c:37:rx_node_create()]

and dpservice exits.

Oh that would have bitten me in our lab testing. I forgot to test the high number.

This is because the TAP port is not counted the same way like the others, which is a deliberate choice by Tao, but made sense to me also as we handle it differently. As a quick fix for your test simply lowering the number of VFs (i think even by one) will make it work, but this needs further checking of course.

PlagueCZ commented 1 month ago

When compiled with enable_pf1_proxy, the service will never run in the "normal mode" anymore, as the code does not actually check for the presence of the command-line switch. This is needed for our testing to be able to switch without actually changing the image.

byteocean commented 1 month ago

Thanks for final testing in the lab cluster and merging it.