openlogic / AzureBuildCentOS

Kickstart scripts and other components to build CentOS images for Azure
Other
20 stars 22 forks source link

Inconsistencies in some openlogic centos mirrors #125

Closed pbertin closed 1 year ago

pbertin commented 1 year ago

For the last couple days, we have been observing update failures on OpenLogic:CentOS:7_9-gen2 instances, which apparently come from inconsistencies between repository replicas

"yum update" or "yum makecache" would most of the time fail with error messages similar to:

updates-openlogic/7/x86_64/fil FAILED                                          
http://olcentgbl.trafficmanager.net/centos/7/updates/x86_64/repodata/filelists.sqlite.bz2: [Errno -1] Metadata file does not match checksum           ]  0.0 B/s |    0 B  --:--:-- ETA 
Trying other mirror.
updates-openlogic/7/x86_64/oth FAILED                                          
http://olcentgbl.trafficmanager.net/centos/7/updates/x86_64/repodata/other.sqlite.bz2: [Errno -1] Metadata file does not match checksum               ]  0.0 B/s |    0 B  --:--:-- ETA 
Trying other mirror.

Indeed, fetching those files in a loop show different versions being alternatively served from the load balancer endpoint:

# while curl -sI http://olcentgbl.trafficmanager.net/centos/7/updates/x86_64/repodata/filelists.sqlite.bz2 | grep -e Last-Modified -e Content-Length; do sleep 1; done
Last-Modified: Sat, 21 Jan 2023 09:33:22 GMT
Content-Length: 11113614
Last-Modified: Sun, 22 Jan 2023 21:44:17 GMT
Content-Length: 11115537
[...]

This has been observed at least in the "northeurope" and "francecentral" Azure regions

N3WWN commented 1 year ago

I did find one of our load balancers was not set for persistent sessions, but that was the one in Southeast Asia. I just fixed that setting. All other load balancers are set for persistent sessions keyed on the client IP.

If you can share the public IP of your system, I can check which actual nodes it was hitting.

pbertin commented 1 year ago

I have an instance currently experiencing the issue in the "northeurope" Azure region Its public IP is 13.94.94.43 The replica inconsistency can for example be seen when looping over: curl -Iv http://olcentgbl.trafficmanager.net/centos/7/updates/x86_64/repodata/filelists.sqlite.bz2

This hostname resolves alternately to two different IPs, which return different set of results:

Let me know if I can be of further help, of course

N3WWN commented 1 year ago

Thank you @pbertin !

Using this info, I was able to determine that your system was hitting one of our older nodes and one of our newer nodes.

As a little background, our load balancers are nested into two tiers: global and regional.

We've been adding new nodes to our repos and the 1st tier was not maintaining Client IP persistence between the regional load balancer for the old nodes and the regional load balancer for the new nodes. Since both old and new nodes were in the same region, the 1st tier would return each of them in a round-robin fashion.

This should be fixed now and the global tier load balancer should no longer return differing repo nodes upon subsequent connections.