snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.97k stars 301 forks source link

Plan for Mellanox ConnectX-4 support #706

Open lukego opened 8 years ago

lukego commented 8 years ago

This issue spells out a plan for adding support for Mellanox ConnectX-4 NICs to Snabb Switch.

The idea is to develop "native" support for Mellanox cards. This requires understanding the hardware in depth and writing our own drivers. The necessary information is now public (as of June 2016): Mellanox Adapter Programmer's Reference Manual.

Logistics:

Driver:

Integration:

Related work:

Risks and unknowns:

Those risks should be taken care of when the PRM is released and we have a card to experiment with.

ghost commented 8 years ago

Regarding packetblaster is there a possibility that it is driver agnostic? What if I wanted to use it with a tap device?

lukego commented 8 years ago

@nnikolaev-virtualopensystems The normal operation of packetblaster is to create transmit descriptors and then reuse them in a loop. This works on Intel NICs. Hopefully works on Mellanox NICs too. I suspect it would also work on Virtio-net devices by changing the vring used and avail indexes. If it works for Virtio-net vrings then it should also work for a tap device if you use /dev/vhost-net to access that with a vring.

Failing all of that, it would also be possible for packetblaster to have a mode where it doesn't use the "DMA reset" trick and simply transmits packets. That should work for any I/O device but would likely become CPU bound if you use many NICs. (packetblaster does 100G - 10x10G - without breaking a sweat and that is because of the trick.)

lukego commented 8 years ago

Added a link to Mellanox firmware release notes. Interesting reading. Gives some insight into the line between hardware/firmware on the cards i.e. which bugs are fixed and features added by firmware upgrades.

lukego commented 8 years ago

I have pushed a major update of the ConnectX-4 driver in commit 7659eb61fcf01cc9db342dfdc040644c23b45186.

I have been able to initialize the card and transmit and receive packets now. I am reformulating the code more cleanly now. The initialization side of this is pushed and next I am doing clean transmit and receive. I also need to provide a suitable API for multiprocess operation and for setting up the Flow Tables and hashing in useful ways.

I squashed the history on that branch before I pushed to github. I will be fetching some draft code from the more complete history at lukego/mellanox-refactor branch in the short term. Part of the reason to squash is that I had included a relatively large log file from the Linux mlx5 driver in the repo and I'd like to avoid bloating the snabbco/snabb repo with that.

I also started filing issues for things that I am following up with Mellanox support. See issues with tag 'mellanox'.

lukego commented 8 years ago

I pushed a big update to the Mellanox driver with commit 21d0dc36d1f3dd1c022d19c3777339f52c3609b2. There is still work to do but most things are in place now.

The driver is now designed for multiprocess operation for use with #1021. The design is to have one ConnectX4 app for each NIC that performs initialization of all the queues, then to have any number of IO apps that attach to queue-pairs and can run in other Snabb processes.

Example:

-- App to setup the NIC
config.app(c, 'nic', ConnectX4, {pciaddress = '01:00.0', 
                                 queues = {'a', 'b', 'c'}})
-- Apps to perform I/O (can be in other processes)
config.app(c, 'io-a', IO, {pciaddress = '01:00.0', queue = 'a'})
config.app(c, 'io-b', IO, {pciaddress = '01:00.0', queue = 'b'})
config.app(c, 'io-c', IO, {pciaddress = '01:00.0', queue = 'c'})

Currently all queues are setup for hashing (RSS).

There is more to do:

Current basic selftest output with sending/receiving packets between two NICs. (Here we see the apparent issue with the NIC duplicating broadcast packets i.e. sending onto the wire and also back onto local RX.)

selftest: waiting for both links up
Links up. Sending 10,000,000 packets.

NIC0
2,000,000,000 rx_bcast_octets
  20,000,000 rx_bcast_packets
           0 rx_error_octets
           0 rx_error_packets
           0 rx_mcast_octets
           0 rx_mcast_packets
           0 rx_ucast_octets
           0 rx_ucast_packets
1,000,000,000 tx_bcast_octets
  10,000,000 tx_bcast_packets
           0 tx_error_octets
           0 tx_error_packets
           0 tx_mcast_octets
           0 tx_mcast_packets
           0 tx_ucast_octets
           0 tx_ucast_packets

NIC1
2,000,000,000 rx_bcast_octets
  20,000,000 rx_bcast_packets
           0 rx_error_octets
           0 rx_error_packets
           0 rx_mcast_octets
           0 rx_mcast_packets
           0 rx_ucast_octets
           0 rx_ucast_packets
1,000,000,000 tx_bcast_octets
  10,000,000 tx_bcast_packets
           0 tx_error_octets
           0 tx_error_packets
           0 tx_mcast_octets
           0 tx_mcast_packets
           0 tx_ucast_octets
           0 tx_ucast_packets
selftest: complete
eugeneia commented 8 years ago

Great progress! I will try to meet you half way and get the IO(Control) abstractions in order and think about nefarious selftests.

tsuraan commented 7 years ago

Is this still progressing? It looks like things were close to done, and then the ticket stalled or something. It looks like Snabb is seeing a lot of development; is there just a ton of stuff going on around the I/O 2.0 that is holding up these other tickets?

lukego commented 7 years ago

Hi @tsuraan. Yes, I am actually planning to loop back this month and try to get the driver branch ready for upstream.

There is a quite complete driver here: connectx_4.lua. This basically works but seems to often exercise some bad cases in the firmware (sometimes the NIC gets wedged and requires a cold boot server power cycle to recover.) The more recent firmwares are also lacking some important information from their release notes (definitions of new error codes that are appearing sometimes.) It's a bit of a slow and frustrating process to resolve these issues but I have some new leads that I plan to follow up shortly.