Recently we found one critical issue in libteam.
Sometimes (very rarely) we can see that teamd stuck in some state and keep in it until we restart it.
In this state:
Teamd doesn’t send any log output
Teamd sends out lacp packets as usual. The packets has State Flags [Activity, Aggregation, Synchronization, Collecting, Distributing]
When teamd first enables member ports and then receive carrier up everything works as expected. This order in 99% of cases
When teamd first receives “carrier up” message, it will be never show anything after that.
For example:
Working session:
Feb 7 23:16:59.385812 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Ethernet16: Adding port (found ifindex "28").
Feb 7 23:16:59.464625 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Ethernet17: Adding port (found ifindex "29").
Feb 7 23:17:10.455175 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Ethernet17: Enabling port
Feb 7 23:17:10.477280 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Ethernet16: Enabling port
Feb 7 23:17:10.477280 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel16[37]: Enable carrier. Number of enabled ports 2 >= configured min_ports 2
Feb 7 23:17:10.477280 str-s6100-acs-1 INFO teamd#teamd_PortChannel16[37]: carrier changed to UP
Session which are stuck:
Feb 19 22:57:29.826177 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel1019[170]: Ethernet72: Adding port (found ifindex "59").
Feb 19 22:57:30.246222 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel1019[170]: Enable carrier. Number of enabled ports 1 >= configured min_ports 1
Feb 19 22:57:30.252943 str-s6100-acs-1 INFO teamd#teamd_PortChannel1019[170]: carrier changed to UP
Feb 19 22:57:30.263326 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel1019[170]: Enable carrier. Number of enabled ports 1 >= configured min_ports 1
Feb 19 22:57:30.263707 str-s6100-acs-1 DEBUG teamd#teamd_PortChannel1019[170]: Enable carrier. Number of enabled ports 1 >= configured min_ports 1
After that no messages from the teamd, but it still sends updates, and traffic is being blackholed.
Recently we found one critical issue in libteam. Sometimes (very rarely) we can see that teamd stuck in some state and keep in it until we restart it. In this state:
I analyzed logs for our teamd and found that:
For example: Working session:
Session which are stuck:
After that no messages from the teamd, but it still sends updates, and traffic is being blackholed.