jpirko / libteam

team netdevice library
GNU Lesser General Public License v2.1
231 stars 59 forks source link

teamd stuck in some state and keep in it until we restart it. #48

Open lguohan opened 4 years ago

lguohan commented 4 years ago

Recently we found one critical issue in libteam. Sometimes (very rarely) we can see that teamd stuck in some state and keep in it until we restart it. In this state:

  1. Teamd doesn’t send any log output
  2. Teamd sends out lacp packets as usual. The packets has State Flags [Activity, Aggregation, Synchronization, Collecting, Distributing]
  3. Teamd member ports is disabled: root@host:/# teamnl PortChannel1013 options lb_port_stats (port:Ethernet48) \00\00\00\00\00\00\00\00 queue_id (port:Ethernet48) 0 priority (port:Ethernet48) 0 user_linkup_enabled (port:Ethernet48) false user_linkup (port:Ethernet48) true enabled (port:Ethernet48) false lb_stats_refresh_interval 0

I analyzed logs for our teamd and found that:

  1. When teamd first enables member ports and then receive carrier up everything works as expected. This order in 99% of cases
  2. When teamd first receives “carrier up” message, it will be never show anything after that.

For example: Working session:

Feb  7 23:16:59.385812 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Ethernet16: Adding port (found ifindex "28").
Feb  7 23:16:59.464625 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Ethernet17: Adding port (found ifindex "29").
Feb  7 23:17:10.455175 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Ethernet17: Enabling port
Feb  7 23:17:10.477280 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Ethernet16: Enabling port
Feb  7 23:17:10.477280 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Enable carrier. Number of enabled ports 2 >= configured min_ports 2
Feb  7 23:17:10.477280 str-s6100-acs-1 INFO teamd#teamd_PortChannel16[37]: carrier changed to UP

Session which are stuck:

Feb 19 22:57:29.826177 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel1019[170]: Ethernet72: Adding port (found ifindex "59").
Feb 19 22:57:30.246222 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel1019[170]: Enable carrier. Number of enabled ports 1 >= configured min_ports 1
Feb 19 22:57:30.252943 str-s6100-acs-1 INFO teamd#teamd_PortChannel1019[170]: carrier changed to UP
Feb 19 22:57:30.263326 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel1019[170]: Enable carrier. Number of enabled ports 1 >= configured min_ports 1
Feb 19 22:57:30.263707 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel1019[170]: Enable carrier. Number of enabled ports 1 >= configured min_ports 1

After that no messages from the teamd, but it still sends updates, and traffic is being blackholed.