golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123k stars 17.53k forks source link

net/http: TestTransportDialTLSContext flakes in local testing #59035

Open rsc opened 1 year ago

rsc commented 1 year ago
% go test -c net/http
% stress ./http.test -test.count=10 -test.run=TestTransportDialTLSContext
254 runs so far, 0 failures
572 runs so far, 0 failures
860 runs so far, 0 failures
1163 runs so far, 0 failures
1460 runs so far, 0 failures
1737 runs so far, 0 failures

/var/folders/pm/qx0jkz117r1cvc05kf31fhh8000n9d/T/go-stress173291610
--- FAIL: TestTransportDialTLSContext (25.91s)
    --- FAIL: TestTransportDialTLSContext/h2 (25.91s)
        transport_test.go:3821: Get "https://127.0.0.1:59866": dial tcp 127.0.0.1:59866: connect: operation timed out
FAIL

ERROR: exit status 1

2026 runs so far, 1 failures
2320 runs so far, 1 failures
2615 runs so far, 1 failures
2915 runs so far, 1 failures
3212 runs so far, 1 failures

/var/folders/pm/qx0jkz117r1cvc05kf31fhh8000n9d/T/go-stress896441329
--- FAIL: TestTransportDialTLSContext (25.92s)
    --- FAIL: TestTransportDialTLSContext/https1 (25.91s)
        transport_test.go:3821: Get "https://127.0.0.1:59866": dial tcp 127.0.0.1:59866: connect: operation timed out
FAIL

ERROR: exit status 1

3512 runs so far, 2 failures
3804 runs so far, 2 failures

I see a similar failure during all.bash from time to time, but stress is an easier way to reproduce it.

/cc @neild

bcmills commented 1 year ago

57475 may be related.

bcmills commented 1 year ago

And #57475.

ianlancetaylor commented 1 year ago

A connection to #57475 doesn't seem likely to me. That test uses test hooks to force a timeout, and those failures were about a forced timeout not happening as fast as expected. This test is not using test hooks and appears to be making real network connections.

neild commented 1 year ago

I can replicate this on my laptop (darwin-arm64), but not on linux-amd64. Not sure about root cause, but I suspect something platform specific.

neild commented 1 year ago

This made me think of #37795, although the setup is a bit different and I don't think that precise bug should apply here.

I ran the test case from https://github.com/golang/go/issues/37795#issuecomment-1149201171 to see if the underlying macOS bug causing #37795 still exists. This test consists of a C client and server. The client loops, connecting to the server and immediately closing each connection. The server accepts connections, attempts to read one byte (returning EOF), and then closes the connection. Previously, this test would hang in the read.

The read timeout seems to be fixed, but the client in this test now hangs for me after a few iterations, returning ETIMEDOUT from connect. That seems like the same behavior seen in the flake here.

Not sure yet if this is a kernel bug or something else (port exhaustion? antivirus getting out of hand?).

neild commented 1 year ago

I thought this might be port exhaustion, but that doesn't appear the case.

I can reproduce what appears to be the same issue using a pair of C programs (small variant from the ones in #37795).

Client:

#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>

int main() {
        int i = 0;
        while (1) {
                int sock = socket(PF_INET, SOCK_STREAM, 0);
                if (sock < 0) {
                        perror("socket");
                        return 1;
                }

                struct sockaddr_in raddr;
                raddr.sin_family = AF_INET;
                raddr.sin_port = htons(2020);
                if (inet_aton("127.0.0.1", &raddr.sin_addr) <= 0) {
                        perror("inet_aton");
                        return 1;
                }

                if (connect(sock, (struct sockaddr *)&raddr, sizeof(raddr)) < 0) {
                        perror("connect");
                        return 1;
                }
                close(sock);

                printf("%d\n", i);
                i++;

                // Sleep for a bit to avoid ephemeral port exhaustion.
                if (i % 100 == 0) {
                        usleep(1000000);
                }
        }
}

Server:

#include <netinet/in.h>
#include <stdio.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>

int main() {
        int sock = socket(PF_INET, SOCK_STREAM, 0);
        if (sock < 0) {
                perror("socket");
                return 1;
        }

        struct sockaddr_in laddr;
        laddr.sin_family = AF_INET;
        laddr.sin_port = htons(2020);
        laddr.sin_addr.s_addr = 0;
        if (bind(sock, (struct sockaddr *)&laddr, sizeof(laddr)) < 0) {
                perror("bind");
                return 1;
        }

        if (listen(sock, 10) < 0) {
                perror("listen");
                return 1;
        }

        while (1) {
                struct sockaddr_in sin;
                socklen_t slen = sizeof(sin);
                int c = accept(sock, (struct sockaddr *)&sin, &slen);
                if (c < 0) {
                        perror("accept");
                        return 1;
                }
                close(c);
        }
}

Run the server. Run the client. The client makes connections to the server and closes them, with a delay every 100 connections to avoid ephemeral port exhaustion. After a while (seems to take ~16,000 connection attempts for me, so 2-3 minutes) the client's connect call fails with ETIMEDOUT.

rsc commented 1 year ago

I ran this on my Google-issued laptop and got the failure, but I've been unable to reproduce it on my personal MacBook Pro. It's possible this one is due to some non-standard antivirus or firewall on the Google laptop.