sonic-net / sonic-buildimage

Scripts which perform an installable binary image build for SONiC
Other
723 stars 1.38k forks source link

[DPB] [lldp] LLDP does not have logic to gracefully handle port deletion #9802

Open alexrallen opened 2 years ago

alexrallen commented 2 years ago

Description

When we delete a port using dynamic port breakout lldp does not gracefully handle the port deletion.

This is likely due to the fact that in the lldpmgrd logic it attempts to lookup the alias and port name of a deleted port in Config DB, a request that we know will fail.

https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-lldp/lldpmgrd#L288

Steps to reproduce the issue:

  1. Use dynamic port breakout to remove a port running on the switch (go from a 2x to 1x breakout mode or similar)

Describe the results you received:

Jan 19 15:50:40.908722 r-lionfish-14 ERR lldp#lldpmgrd[33]: Port 'Ethernet236' not found in PORT table in Config DB. Using port name instead of port alias.

This error is seen for all removed ports.

Describe the results you expected:

No Error.

dgsudharsan commented 2 years ago

@zhenggen-xu @praveen-li Can you please investigate?

praveen-li commented 2 years ago

Kindly share initial config and command used for breakout. Seems like config_Db did not have the entry initially for a port and that port is added later by breakout. But lldrmgrd complains that port is not present in config db which is a local copy taken at start.

@alexrallen

alexrallen commented 2 years ago

@praveen-li I have attached a techsupport which should give you access to the state of config_db.json as well as the state of CONFIG_DB.

What we did was breakout 4 ports from 100G to 4x25G and then clean up those breakouts. You should be able to follow the syslog and see that pattern.

sonic_dump_r-lionfish-14_20220119_161453.tar.gz

liat-grozovik commented 2 years ago

@praveen-li this issue is for 202111. can you please help to provide ETA for fix to be available?

praveen-li commented 2 years ago

As per Code, LLDP listens to DEL event in config DB. So we need more data on:

-- Do you observe LLDP listens to only specific DEL event in config DB, but not all. -- If listens to all, then Is it a regression introduced later in LLDP or do we surely know, it is day 1 issue which is exposed by DPB?

Thanks,

Regards Praveen


From: liat-grozovik @.> Sent: Tuesday, March 8, 2022 1:32 AM To: Azure/sonic-buildimage @.> Cc: Praveen Chaudhary @.>; Mention @.> Subject: Re: [Azure/sonic-buildimage] [DPB] [lldp] LLDP does not have logic to gracefully handle port deletion (Issue #9802)

@praveen-lihttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpraveen-li&data=04%7C01%7Cpchaudhary%40linkedin.com%7C8d78c918e5de4c32459808da00e6918e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823287530014854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=L46logXykj01a99SRh3lph%2FWcNp3%2F4CqF5hy8M7kLlo%3D&reserved=0 this issue is for 202111. can you please help to provide ETA for fix to be available?

— Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fsonic-buildimage%2Fissues%2F9802%23issuecomment-1061578926&data=04%7C01%7Cpchaudhary%40linkedin.com%7C8d78c918e5de4c32459808da00e6918e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823287530014854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7lwQ7OlC7wFenYI4iM%2FPGa5ik7W032IKQI3Hj46oN%2B8%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJHZHPIKKXMQG3CVMSPNXWLU64NCZANCNFSM5MKYFL3Q&data=04%7C01%7Cpchaudhary%40linkedin.com%7C8d78c918e5de4c32459808da00e6918e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823287530014854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=pqwbo5OETlv7YsPfqQNNeFzIcVDJWt1XaMZg6NZsqwo%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cpchaudhary%40linkedin.com%7C8d78c918e5de4c32459808da00e6918e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823287530014854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=xWBDDexiX5IyO5MeMW4qcMnf%2FNimRSYUWvr2gDcX3ZI%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cpchaudhary%40linkedin.com%7C8d78c918e5de4c32459808da00e6918e%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823287530014854%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=4TDZRvLyVBYYXW0YsxoZtbC50hm%2FeXHIRS2YEkSNz9k%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

alexrallen commented 2 years ago

Wouldn't this line of code fail when a port is deleted? https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-lldp/lldpmgrd#L299

It attempts to pull a port that no longer exists from CONFIG_DB here https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-lldp/lldpmgrd#L140

praveen-li commented 2 years ago

When DPB deletes the Port, Config Changes is

{
'PORT': {
'Ethernet8': None
}
}

Here we seems to be checking >>>> "if ("alias" in fvp_dict or "description" in fvp_dict)" in https://github.com/Azure/sonic-buildimage/blob/d112e7cca9087d02892ada9d70bddcc5d5616c24/dockers/docker-lldp/lldpmgrd#L297

BTW, we do not have a local repro for this in LinkedIn, so if possible please log all FVs in case of config_db changes during DPB. FVs data may give us some idea as why LLDP is failing here, or is it a timing issue.

Thanks,

Regards Praveen


From: Alexander Allen @.> Sent: Tuesday, March 8, 2022 1:53 PM To: Azure/sonic-buildimage @.> Cc: Praveen Chaudhary @.>; Mention @.> Subject: Re: [Azure/sonic-buildimage] [DPB] [lldp] LLDP does not have logic to gracefully handle port deletion (Issue #9802)

Wouldn't this line of code fail when a port is deleted? https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-lldp/lldpmgrd#L299https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fsonic-buildimage%2Fblob%2Fmaster%2Fdockers%2Fdocker-lldp%2Flldpmgrd%23L299&data=04%7C01%7Cpchaudhary%40linkedin.com%7Cefa1bddf65d849ea858608da014e1630%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823732150023196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=RdR%2FHCcAUYZVuezMd6APUledZiypdwMn9aQCm3iZU8Q%3D&reserved=0

It attempts to pull a port that no longer exists from CONFIG_DB here https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-lldp/lldpmgrd#L140https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fsonic-buildimage%2Fblob%2Fmaster%2Fdockers%2Fdocker-lldp%2Flldpmgrd%23L140&data=04%7C01%7Cpchaudhary%40linkedin.com%7Cefa1bddf65d849ea858608da014e1630%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823732150023196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=HwsNlFA7AH8lU6TMUnIgWF%2BIWXQC84mY%2FFR0xnr0xZA%3D&reserved=0

— Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fsonic-buildimage%2Fissues%2F9802%23issuecomment-1062249951&data=04%7C01%7Cpchaudhary%40linkedin.com%7Cefa1bddf65d849ea858608da014e1630%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823732150023196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=hLkL36j09DNPm8ugv3OEyyPVUF1YC11OshwmfjDP7zw%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJHZHPJZNWGIP6ZU324QHNLU67D5PANCNFSM5MKYFL3Q&data=04%7C01%7Cpchaudhary%40linkedin.com%7Cefa1bddf65d849ea858608da014e1630%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823732150023196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qC7T7RK7yG6dpAtq2T7nmpATDOYE0bqIoUY%2BiakG8U8%3D&reserved=0. Triage notifications on the go with GitHub Mobile for iOShttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fapps.apple.com%2Fapp%2Fapple-store%2Fid1477376905%3Fct%3Dnotification-email%26mt%3D8%26pt%3D524675&data=04%7C01%7Cpchaudhary%40linkedin.com%7Cefa1bddf65d849ea858608da014e1630%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823732150023196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=hIQgDph2SP8eXq72ecGVU2hG3xlSzl8rpMC4rL%2FvD4Y%3D&reserved=0 or Androidhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fplay.google.com%2Fstore%2Fapps%2Fdetails%3Fid%3Dcom.github.android%26referrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&data=04%7C01%7Cpchaudhary%40linkedin.com%7Cefa1bddf65d849ea858608da014e1630%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637823732150023196%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=MIHeX5WhowFiwfxLHD%2BhNj82ckGMARSas00d%2BXqvcpA%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

alexrallen commented 2 years ago

Hrm okay looking at this a bit more I think you are correct @praveen-li that this is a timing issue.

Specifically I believe this is a race condition between APP_DB and LLDP.

In the log we can see directly before portmgrd does a deletion of the port it sets it to admin down.

Jan 19 15:50:40.907150 r-lionfish-14 NOTICE swss#portmgrd: :- doTask: Configure Ethernet236 admin status to down
Jan 19 15:50:40.907150 r-lionfish-14 NOTICE swss#portmgrd: :- doTask: Delete Port: Ethernet236

I believe that before LLDP handles the admin_down operation the port is also deleted so that when it gets to the following line... https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-lldp/lldpmgrd#L319

It will have passed the if "oper_status" in fvp_dict check on line 317 and then results in the error message on line 319.

Granted, I am not seeing reproductions any more manually, I am checking internally to see if we are seeing reproductions of this error message, due to the apparent racing nature of it, they might be infrequent.

zhenggen-xu commented 2 years ago

Is this a temp issue and would be recovered by it own later?

By looking at the code, it does have timing issue where the code use different snapshot of the app PORT table make decision vs execution. If this is the case, we may need refactor the code to take a snapshot of the PORT table once then make decision and execution based on that.

alexrallen commented 2 years ago

Looks like it does recover yes, the error log is seen only during the race condition.

Even still, I feel a refactor may be wise to prevent false positive error logs and to remove undefined behavior.

zhenggen-xu commented 2 years ago

I agree, we should fix it. The appDB PORT table should be read once to local memory in the while loop, and then it should be used in the functions in the loop. This can avoid inconsistency in different functions having potential different views of the table.