Investigate connection failures of Cygwin buildbot

embray commented 5 years ago

This is more of an operational ticket and not something that needs to be fixed in Sage itself.

But I wanted to create a specific place to track my investigation of this problem.

It's an important problem, because this issue has been rendering the (previously, albeit briefly working) Cygwin buildbot effectively inoperational for a couple months, though the issue seemed a little beyond my purview to do much about.

The problem is that the Cygwin Buildbot runs on the OpenStack service provided by UPSud at LAL (the linear accelerator). This is where I host several other builds as well, on Linux machines and Windows. I am also hosting a Cygwin buildbot for CPython which is has been plagued by the same problem.

The problem is, effectively, that long-held TCP connections to the Windows VM instances seem to get randomly dropped. I am not (to my knowledge, yet) noticing this problem on Linux instances. This includes all TCP connections, so for example if I have an SSH connection to the machine, the connection gets closed. The problem seems to occur on the VM's side. This is even despite significant tweaking with keep-alive options to ensure that both client and server are sending regular pings to each other.

The problem as this affects builds, is that buildbot holds open a TCP connection between the build-worker and the build-master (on port 8010 I believe, but it doesn't matter). If that connection is broken, it seems, the build is lost (even if worker and master are later able to reconnect). I wish buildbot could be a little more robust here, and I might even look into what I can do about that. But in the meantime that is the problem.

So for long-running builds (as most builds for Sage are), at some point into the build the connection is lost and so is the build. Buildbot will sometimes try to retry the build, but even the retries will fail for the same reason.

I have a suspicion that whatever causes this disconnection would probably cause the SSH disconnect at the same time, but have yet to confirm this.

This problem did not used to occur at all, so I began to suspect a problem in the network infrastructure somewhere close to the OpenStack service, though again I note that I've only seen this problem on my Windows instances.

I am now investigating a possible source of the problem related to DHCP leases: One concrete clue I have found in the Windows Event logs on one of the instances is that the disconnects coincide with a message in the DHCP-Client logs (with minor censorship):

The IP address lease aaa.aaa.aa.aa for the Network Card with network address 0xXXXXXXXXXXXX has been denied by the DHCP server bbb.bbb.bb.bb (The DHCP Server sent a DHCPNACK message).

Just a little before its current DHCP lease is set to expire, Windows requests a new one and appears to succeed. It obtains a new lease beginning at right that time. But at the same time the new lease is obtained, one of these error messages shows up in the log, and my connection drops at about the same time.

That's about all I know at this point, but the DHCP issue is at least a concrete thing I can investigate.

CC: @slel @vbraun

Component: misc

Reviewer: Dima Pasechnik

Issue created by migration from https://trac.sagemath.org/ticket/28092

embray commented 5 years ago

Description changed:

--- 
+++ 
@@ -13,7 +13,8 @@
 So for long-running builds (as most builds for Sage are), at some point into the build the connection is lost and so is the build.  Buildbot *will* sometimes try to retry the build, but even the retries will fail for the same reason.

 I have a suspicion that whatever causes this disconnection would probably cause the SSH disconnect at the same time, but have yet to confirm this. 
- This problem did not used to occur at all, so I began to suspect a problem in the network infrastructure somewhere close to the [OpenStack](../wiki/OpenStack) service, though again I note that I've only seen this problem on my Windows instances.
+
+This problem did not used to occur at all, so I began to suspect a problem in the network infrastructure somewhere close to the [OpenStack](../wiki/OpenStack) service, though again I note that I've only seen this problem on my Windows instances.

 I am now investigating a possible source of the problem related to DHCP leases: One concrete clue I have found in the Windows Event logs on one of the instances is that the disconnects coincide with a message in the DHCP-Client logs (with minor censorship):

embray commented 5 years ago

comment:2

For lack of better understanding what's going on with DHCP, I've tried disabling DHCP on the server and assigning a static IP. I don't know if that means OpenStack will then try to give that IP to another instance or what, but we'll see what happens...

embray commented 5 years ago

comment:3

So far as I can tell, the default network settings for our OpenStack project are to give each compute node a fixed IP anyways, so setting the IP to static on VM should work for now. Indeed, >20 minutes going and no disconnects (usually it was roughly every 15 minutes when it tried to renew the DHCP lease...)

It would still be nice to know what's going on with this, but if this works as a workaround I can live with it, and maybe update my deployment scripts to set instances to a static IP after first obtaining its IP from DHCP.

dimpase commented 3 years ago

comment:5

can this be closed, as we can now build on GitHUb Actions?

dimpase commented 2 years ago

Reviewer: Dima Pasechnik

sagemath / sage

Investigate connection failures of Cygwin buildbot #28092