iqiyi / dpvs

DPVS is a high performance Layer-4 load balancer based on DPDK.
Other
3k stars 723 forks source link

Report DPVS fdir bug #76

Closed 316953425 closed 6 years ago

316953425 commented 6 years ago

我的配置如下

我仅仅使用了一个网卡(X710)

我dpvs的配置文件为dpvs.conf.single-nic.sample 我发现一个奇怪的现象,只用我当的网卡队列数设置为1的时候,是可以curl成功的

通过调试代码发现,当队列数为2的时候,curl 服务(客户端IP:10.112.95.3),

拓扑 客户端 10.112.95.3 vip 10.114.249.201 lip 10.114.249.202 服务器 10.112.95.3

log 有如下输出:

lcore 2 port0 ipv4 hl 5 tos 0 tot 60 id 35328 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.201 lcore 1 port0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.202 lcore 2 port0 ipv4 hl 5 tos 0 tot 60 id 35329 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.201 lcore 1 port0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.202 lcore 1 port0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.202 lcore 2 port0 ipv4 hl 5 tos 0 tot 60 id 35330 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.201 lcore 1 port0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.202 lcore 1 port0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.202 lcore 2 port0 ipv4 hl 5 tos 0 tot 60 id 35331 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.201 lcore 1 port0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.202 lcore 1 port0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.202 lcore 2 port0 ipv4 hl 5 tos 0 tot 60 id 35332 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.201 lcore 1 port0 ipv4 hl 5 tos 0 tot 52 id 0 ttl 60 prot 6 src 10.112.95.3 dst 10.114.249.202

316953425 commented 6 years ago

看是是没人能解决了

beacer commented 6 years ago

FDIR for multi-core is widely used on our production environment, like FullNAT and SNAT modes, and it seems stable. Most likely, there is some miss configuration to cause this issue. It seems you have same IP for RS and client, can you try a different one ? We'll check it if have time. However, we really have limited resources on development, and have high priority features and bugs to work on first. So it's hard to debug and support each issues reported in time. And any contribution is well come.

316953425 commented 6 years ago

对于客户端和服务器不能是同一台机器这种限制,如果存在必然不合理吧

如果方便的话,能否告知一下你们在实际环境中使用的网卡型号?

lvsgate commented 6 years ago

local addr加多点吧,如果实现跟阿里的fullnat一样,local addr跟队列是多对一关系,local addr取模队列数量来分配

316953425 commented 6 years ago

我看他的代码是先给每个cpu分配一定数量的lip port,而cpu与队列又有直接的对应关系,然后在通过 struct rte_eth_fdir_filter filt[MAX_FDIR_PROTO] = { { .input.flow_type = RTE_ETH_FLOW_NONFRAG_IPV4_TCP, .input.flow.tcp4_flow.ip.dst_ip = dip, .input.flow.tcp4_flow.dst_port = dport,

        .action.behavior = RTE_ETH_FDIR_ACCEPT,
        .action.report_status = RTE_ETH_FDIR_REPORT_ID,
        .soft_id = filter_id[0],
    },
    {
        .input.flow_type = RTE_ETH_FLOW_NONFRAG_IPV4_UDP,
        .input.flow.udp4_flow.ip.dst_ip = dip,
        .input.flow.udp4_flow.dst_port = dport,

        .action.behavior = RTE_ETH_FDIR_ACCEPT,
        .action.report_status = RTE_ETH_FDIR_REPORT_ID,
        .soft_id = filter_id[1],
    },
};

结构体设置fdir,也就是队列与lip port的对应关系,

但是我比较疑惑的是这样的话,其实只设置了一个port啊,而lip port与cpu 的对应关系是多对于1的

以一个lip 两个队列 两个cpu为例:

cpu1 --- queue 1 ---- port(1026 1028 1030......) cpu2 --- queue 2 ---- port(1025 1027 1029......)

而在设置的时候上述结构体中input.flow.udp4_flow.dst_port 的值分别为0,1又没有设置掩码,这事我比较疑惑的地方 @beacer

beacer commented 6 years ago

因为内网网段划分的关系,内网IP资源有限,所以没有采用阿里的用许多LIP设置fdir,而是用<lip,lport/mask>作为fdir的filter @lvsgate。虽然最终LIP数目和并发有关,也不能用的太少。 @316953425 ,目前的逻辑是选取lport的N bits设置fdir,其中2^N > lcore数目,也就是为每个core分配mask不同的lport段。掩码部分的设置在netif_port_fdir_dstport_mask_set。需要几位掩码根据cpu core数目决定。

lvsgate commented 6 years ago

我们local address不会用二层的,交换机直接路由一个C的网段到接口的ip上。 local address几十个都是不够用的,踩过坑,冲突很多。

beacer commented 6 years ago

@lvsgate 问题就是分配内网的team没有分配一个C那么多的IP给一台机器当LIP,很多机器共享了,经常出现不够用。这个不是我们能左右的,只能说在未来的规划中注意。 另外使用lip,还是lport,前者软件逻辑会简单不少,但是IP多、除了IP资源告急外,相应运维成本也略高,涉及的资源分配校验保持不冲突等,如果自动化做的好还好说,做的不好就很多人肉工作。 后者用Lport增加了软件的复杂度,但并发不大很大的情况,不需要配置和管理那么多的LIP。

316953425 commented 6 years ago

@beacer 这就奇怪了,那应该没有问题啊~为什么我的session会被分到不通的cpu上呢

lvsgate commented 6 years ago

@beacer 了解,二层网络确实有这个分配问题,三层网络就好多了

beacer commented 6 years ago

@316953425 I've verified you config, two core/queue, and same IP for client/RS. But fail to reproduce your issue. Our NIC is Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01)

  1. my config, no default route as you, because Client/RS is in same net with DPVS.
VIP=192.168.100.100
LIP=192.168.100.200
RS=192.168.100.2

./dpip addr add ${VIP}/24 dev dpdk0
./ipvsadm -A -t ${VIP}:80 -s rr
./ipvsadm -a -t ${VIP}:80 -r ${RS} -b

./ipvsadm --add-laddr -z ${LIP} -t 192.168.100.100:80 -F dpdk0
  1. client output

    root # curl 192.168.100.100
    Your ip:port : 192.168.100.2:58723
    root # curl 192.168.100.100
    Your ip:port : 192.168.100.2:58725
    root # curl 192.168.100.100
    Your ip:port : 192.168.100.2:58727
    root # curl 192.168.100.100
    Your ip:port : 192.168.100.2:58729
  2. debug output shows packets in same conn reach same lcore

    IPVS: new conn:  [2] TCP 192.168.100.2:58751 192.168.100.100:80 192.168.100.200:1063 192.168.100.2:80 refs 2
    IPVS: conn lookup: [2] TCP 192.168.100.2:80 -> 192.168.100.200:1063 hit 
    IPVS: conn lookup: [2] TCP 192.168.100.2:58751 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [2] TCP 192.168.100.2:58751 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [2] TCP 192.168.100.2:80 -> 192.168.100.200:1063 hit 
    IPVS: conn lookup: [2] TCP 192.168.100.2:80 -> 192.168.100.200:1063 hit 
    IPVS: conn lookup: [2] TCP 192.168.100.2:58751 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [2] TCP 192.168.100.2:58751 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [2] TCP 192.168.100.2:80 -> 192.168.100.200:1063 hit 
    IPVS: conn lookup: [2] TCP 192.168.100.2:58751 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:58753 -> 192.168.100.100:80 miss 
    IPVS: new conn:  [1] TCP 192.168.100.2:58753 192.168.100.100:80 192.168.100.200:1082 192.168.100.2:80 refs 2
    IPVS: conn lookup: [1] TCP 192.168.100.2:80 -> 192.168.100.200:1082 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:58753 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:58753 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:80 -> 192.168.100.200:1082 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:80 -> 192.168.100.200:1082 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:58753 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:58753 -> 192.168.100.100:80 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:80 -> 192.168.100.200:1082 hit 
    IPVS: conn lookup: [1] TCP 192.168.100.2:58753 -> 192.168.100.100:80 hit 
    IPVS: del conn:  [2] TCP 192.168.100.2:58739 192.168.100.100:80 192.168.100.200:1053 192.168.100.2:80 refs 0
    IPVS: del conn:  [2] TCP 192.168.100.2:58741 192.168.100.100:80 192.168.100.200:1055 192.168.100.2:80 refs 0
  3. My cpu/queue config for your refer,

netif_defs {
    !<init> pktpool_size     524287
    <init> pktpool_size     250000
    <init> pktpool_cache    256

    <init> device dpdk0 {
        rx {
            queue_number        2
            descriptor_number   1024
            rss                 tcp
        }
        tx {
            queue_number        2
            descriptor_number   1024
        }
    !    promisc_mode
        kni_name                dpdk0.kni
    }
}

worker_defs {                                                                                                                                                                                                                  [96/8014]
    <init> worker cpu0 {
        type    master
        cpu_id  0
    }

    <init> worker cpu1 {
        type    slave
        cpu_id  1
        port    dpdk0 {
            rx_queue_ids     0
            tx_queue_ids     0
            ! isol_rx_cpu_ids  9
            ! isol_rxq_ring_sz 1048576
        }
    }

    <init> worker cpu2 {
        type    slave
        cpu_id  2
        port    dpdk0 {
            rx_queue_ids     1
            tx_queue_ids     1
            ! isol_rx_cpu_ids  10
            ! isol_rxq_ring_sz 1048576
        }
    }

}
316953425 commented 6 years ago

@beacer 感谢您的帮助,确实是一样的 只不过我的配置文件比你了些如下内容:

! timer config timer_defs {

cpu job loops to schedule dpdk timer management

schedule_interval    500

}

! dpvs neighbor config neigh_defs {

unres_queue_length 128 pktpool_size 1023 pktpool_cache 32 timeout 60 } ! dpvs ipv4 config ipv4_defs { default_ttl 64 fragment { bucket_number 4096 bucket_entries 16 max_entries 4096 ttl 1 } } ! control plane config ctrl_defs { lcore_msg { ring_size 4096 multicast_queue_length 256 sync_msg_timeout_us 2000 } ipc_msg { unix_domain /var/run/dpvs_ctrl } } ! ipvs config ipvs_defs { conn { conn_pool_size 2097152 conn_pool_cache 256 conn_init_timeout 3 ! expire_quiescent_template ! fast_xmit_close } udp { defence_udp_drop timeout { normal 300 last 3 } } tcp { defence_tcp_drop timeout { none 2 established 90 syn_sent 3 syn_recv 30 fin_wait 7 time_wait 7 close 3 close_wait 7 last_ack 7 listen 120 synack 30 last 2 } synproxy { synack_options { mss 1452 ttl 63 sack ! wscale ! timestamp } ! defer_rs_syn rs_syn_max_retry 3 ack_storm_thresh 10 max_ack_saved 3 conn_reuse_state { close time_wait ! fin_wait ! close_wait ! last_ack } } } } ! sa_pool config sa_pool { pool_hash_size 16 } 但是应该是不影响的,我现在使用的x710,后续换个网卡试试吧,别的就实在没区别了,看看是不是x710 对dpdk fdir支持的不好吧~~~
lvsgate commented 6 years ago

@316953425 我们用内核版的驱动i40e,x710是不支持fdir mask的,不知道dpdk的怎么样。

beacer commented 6 years ago

@316953425 I didn't paste all lines in dpvs.conf, the remain part is irrelevant for FDIR, should not affect the result. As @lvsgate mentioned, if x710 support FDIR, but not fdir-mask, FNAT/SNAT won't work, if possible, pls try other NICs like X540 we used. Or pls verify if both fdir-mask and rules works on X710.

316953425 commented 6 years ago

@lvsgate @beacer 多谢~换完以后有结果了,我会在这里及时告知大家,多谢二位

316953425 commented 6 years ago

@beacer @lvsgate 多谢,确实网卡问题,ok了