scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.28k stars 1.54k forks source link

DPDK + i40e does not work for outbound connections #50

Open gleb-cloudius opened 9 years ago

gleb-cloudius commented 9 years ago

We have multiple problems with DPDK integration and i40e RSS, some of them is on our side, and some are DPDK shortcomings, but we should workaround them anyway. DPDK problems are described here http://patchwork.dpdk.org/ml/archives/dev/2015-February/013300.html and as far as I can tell still not addressed. The problem on our side is that we need to configure which RSS algorithm to use (i4oe supports two and default one is not the one seastar code assumes).

dorlaor commented 9 years ago

@vladzcloudius please prioritize it as it should help to show the gains of seastar using i40e with httpd and seawerck. In order to trigger it you just need to run seawreck in dpdk mode using --smp > 1 (in the other side either nginx or httpd will do the job)

vladzcloudius commented 9 years ago

On Aug 23, 2015 23:19, "Dor Laor" notifications@github.com wrote:

@vladzcloudius please prioritize it as it should help to show the gains of seastar using i40e with httpd and seawerck. In order to trigger it you just need to run seawreck in dpdk mode using --smp > 1 (on the other side either nginx or httpd will do the job)

Ok. I'll start working on this.

— Reply to this email directly or view it on GitHub.

vladzcloudius commented 9 years ago

There was an interesting issue with the i40e PMD with the seastar head 3484074a300cad64e1bc780029d89731da12c10b and dpdk head 8a2d53b81c98db356ed02ce6f2a2f573e5790019:

The setup: intel1/2 servers with xl710 ports connected back to back. The http server command on intel2: nginx seawreck command on intel1: sudo ./build/release/apps/seawreck/seawreck --network-stack native --dpdk-pmd --dhcp 0 --host-ipv4-addr 192.168.10.118 --gw-ipv4-addr 192.168.10.199 --netmask-ipv4-addr 255.255.255.0 --collectd 0 --smp 1 -m 20G --server 192.168.10.199:80 --conn 1 --reqs 1

The result: the seawreck stucks.

The tcpdump from the server side:

07:02:06.639001 IP 192.168.10.118.59853 > 192.168.10.199.http: Flags [S], seq 2414480517, win 29200, options [mss 1460,wscale 7,eol], length 0
07:02:06.639083 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [S.], seq 210059785, ack 2414480518, win 29200, options [mss 1460,nop,wscale 7], length 0
07:02:06.639188 IP 192.168.10.118.59853 > 192.168.10.199.http: Flags [.], seq 1:42, ack 1, win 29200, length 41: HTTP: GET / HTTP/1.1
07:02:06.639249 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [.], ack 42, win 229, length 0
07:02:06.639514 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [P.], seq 1:239, ack 42, win 229, length 238: HTTP: HTTP/1.1 200 OK
07:02:06.639601 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [.], seq 239:3159, ack 42, win 229, length 2920: HTTP
07:02:06.639626 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [P.], seq 3159:3939, ack 42, win 229, length 780: HTTP: put your content in a location of
07:02:06.639667 IP 192.168.10.118.59853 > 192.168.10.199.http: Flags [.], ack 239, win 29200, length 0
07:02:06.840506 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [.], seq 239:1699, ack 42, win 229, length 1460: HTTP
07:02:07.243510 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [.], seq 239:1699, ack 42, win 229, length 1460: HTTP
07:02:08.049515 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [.], seq 239:1699, ack 42, win 229, length 1460: HTTP
07:02:09.659504 IP 192.168.10.199.http > 192.168.10.118.59853: Flags [.], seq 239:1699, ack 42, win 229, length 1460: HTTP
07:02:12.358651 IP 192.168.10.199.http > 192.168.10.118.56359: Flags [F.], seq 3700, ack 1, win 229, length 0
07:02:12.358754 IP 192.168.10.118.56359 > 192.168.10.199.http: Flags [R], seq 2659224170, win 0, length 0

On the seawreck side we see the following with this patch:

diff --git a/net/dpdk.cc b/net/dpdk.cc
index 0dcbd0a..b823dfd 100644
--- a/net/dpdk.cc
+++ b/net/dpdk.cc
@@ -700,6 +700,9 @@ build_mbuf_cluster:
                 return nullptr;
             }

+            printf("# from_packet_copy: about to send: len %d nr_frags %d\n",
+                   p.len(), p.nr_frags());
+
             /*
              * Here we are going to use the fact that the inline data size is a
              * power of two.
@@ -1900,6 +1903,8 @@ template<>
 inline std::experimental::optional<packet>
 dpdk_qp<false>::from_mbuf(rte_mbuf* m)
 {
+    printf("# from_mbuf: pkt_len %d data_len %d nr_segs %d\n",
+           m->pkt_len, m->data_len, m->nb_segs);
     if (!_dev->hw_features_ref().rx_lro || rte_pktmbuf_is_contiguous(m)) {
         //
         // Try to allocate a buffer for packet's data. If we fail - give the
@@ -1913,12 +1918,12 @@ dpdk_qp<false>::from_mbuf(rte_mbuf* m)
         if (!buf) {
             // Drop if allocation failed
             rte_pktmbuf_free(m);
-
+            printf("## dropping\n");
             return std::experimental::nullopt;
         } else {
             rte_memcpy(buf, rte_pktmbuf_mtod(m, char*), len);
             rte_pktmbuf_free(m);
-
+            printf("## packet is ok\n");
             return packet(fragment{buf, len}, make_free_deleter(buf));
         }
     } else {
# from_packet_copy: about to send: len 42 nr_frags 1
# from_mbuf: pkt_len 60 data_len 60 nr_segs 1
## packet is ok
# from_packet_copy: about to send: len 62 nr_frags 1
# from_mbuf: pkt_len 62 data_len 62 nr_segs 1
## packet is ok
# from_packet_copy: about to send: len 95 nr_frags 2
# from_mbuf: pkt_len 60 data_len 60 nr_segs 1
## packet is ok
# from_mbuf: pkt_len 292 data_len 292 nr_segs 1
## packet is ok
# from_mbuf: pkt_len 834 data_len 834 nr_segs 1
## packet is ok
# from_packet_copy: about to send: len 54 nr_frags 1
# from_mbuf: pkt_len 60 data_len 60 nr_segs 1
## packet is ok
# from_packet_copy: about to send: len 42 nr_frags 1
# from_mbuf: pkt_len 60 data_len 60 nr_segs 1
## packet is ok
# from_packet_copy: about to send: len 54 nr_frags 1

From the tcpdump above u may notice that both sides advertised the same MTU size - 1500B, however it seems that packets above 1000 bytes don't make it to the receiver on the seawreck side.

vladzcloudius commented 9 years ago

After playing a bit with the MTU size on the server side it's been noticed that if we configure the MTU below or equal 1496B the things begin to work. Therefore it seems that there is an issue in the way i40e PMD configures MTU in the HW.

I have a strong feeling that somebody ignores the CRC size...

edevil commented 8 years ago

So is this fixed according to commit 2df7c9d?

vladzcloudius commented 8 years ago

Yes, this is fixed.

edevil commented 8 years ago

But the issue is still open, it's confusing.

avikivity commented 8 years ago

I don't think all the issues are fixed, we had a problem with RSS hash calculation as well, no?

vladzcloudius commented 8 years ago

@avikivity The RSS issues supposed to be addressed in commits accc07bb273a649889dd504430cdb3f0bc6828f7 and eec89a57a43c070df9d3e29fa649df7237be9f39