rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.23k stars 12.71k forks source link

TcpStream intermittently resets the connection when performing several simultaneous connections against localhost #18847

Closed jdm closed 9 years ago

jdm commented 9 years ago

Using rust-http master and rustc master, the following program intermittently returns a number of "Server returned malformed HTTP response" errors, which under the hood are "connection reset" errors from libstd. I've run this against both Python's SimpleHTTPServer and node's http-server module with no change in behaviour; when run against a remote server I never see any connection errors.

Interesting notes:

extern crate http;
extern crate url;

use http::client::{RequestWriter, NetworkStream};
use http::method::Get;
use std::task;
use url::Url;

static target: &'static str = "http://localhost:8000/index.html";

fn main() {
    for _ in range(0u32, 10u32) {
        task::spawn(proc() {
            let request = RequestWriter::<NetworkStream>::new(Get, Url::parse(target).unwrap());
            let writer = box request.unwrap();
            match writer.read_response() {
                Ok(_) => {},
                Err((_, e)) => println!("{}", e.desc),
            }
        });
    }
}
jdm commented 9 years ago

Note: I've only seen this behaviour on Mac, never on Linux.

jdm commented 9 years ago

Here's an even better testcase that doesn't involve rust-http at all:

use std::io::{IoResult};
use std::io::net::get_host_addresses;
use std::io::net::ip::{SocketAddr, Ipv4Addr};
use std::io::net::tcp::TcpStream;
use std::task;

static TARGET: &'static str = "localhost";

fn url_to_socket_addr(host: &str) -> IoResult<SocketAddr> {
    // Just grab the first IPv4 address
    let addrs = try!(get_host_addresses(host));
    let addr = addrs.into_iter().find(|&a| {
        match a {
            Ipv4Addr(..) => true,
            _ => false
        }
    });

    // TODO: Error handling
    let addr = addr.unwrap();

    let port = 8000;

    Ok(SocketAddr {
        ip: addr,
        port: port
    })
}

fn main() {
    for _ in range(0u32, 10u32) {
        task::spawn(proc() {
            let addr = url_to_socket_addr(TARGET).unwrap();
            let mut stream = TcpStream::connect(addr).unwrap();
            (write!(stream, "GET / HTTP/1.0\r\n")).unwrap();
            (write!(stream, "\r\n")).unwrap();
            stream.flush().unwrap();

            match stream.read_byte() {
                Ok(_) => {
                    stream.read_to_end().unwrap();
                    println!("success!");
                }
                Err(e) => println!("{}", e.desc),
            }
        });
    }
}
zonyitoo commented 9 years ago

I have the same problem when I perform 300 TcpStream connections simultaneously to Memcached (Address is 127.0.0.1:11211).

It could be reproduced every time when I run my benchmark program.

This is my test program

I tested it on my Laptop (Mac OS X 10.10.1, MacBook Pro with Retina Display Late 2013).

reem commented 9 years ago

I have also encountered this error when working with tcp, also OS X 10.10.

alexcrichton commented 9 years ago

@jdm I cannot reproduce the error with this server program, do you have a standalone server error I can play around with? Also, was this a recent regression, or has this been happening for some time now?

use std::io::{TcpListener, Listener, Acceptor};

fn main() {
    let mut l = TcpListener::bind("127.0.0.1:8000").unwrap().listen().unwrap();
    for mut s in l.incoming() {
        let _ = s.read_exact(18);
        let _ = s.write([1]);
    }
}
jdm commented 9 years ago

I've only tested this against the servers I specified in my original comment. I don't know if this is a recent regression; we only started getting enough information out of our test harness to diagnose this recently. FWIW, we see it using a 9/23 nightly.

jdm commented 9 years ago

I used python -m SimpleHTTPServer and node node_modules/http-server/bin/http-server. Nothing special.

alexcrichton commented 9 years ago

Thanks @jdm! I've reproduced locally and I hope to have time to investigate tonight.

alexcrichton commented 9 years ago

I think I would like to see a reproduction of this with some known server running that can be debugged easily. This current setup reproduces the problem seen here, and it should mirror basically what we're doing in Rust:

#include <sys/types.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netdb.h>
#include <stdio.h>
#include <errno.h>
#include <pthread.h>
#include <assert.h>

#define N 10

#define CHECK(e) if (!(e)) {              \
    printf("%s failed: %d\n", #e, errno); \
    perror("failure");                    \
    assert(0);                            \
  }

void *child(void *foo) {
  int s = socket(AF_INET, SOCK_STREAM, 0);
  CHECK(s != -1);

  struct sockaddr_in ip4addr;
  ip4addr.sin_family = AF_INET;
  ip4addr.sin_port = htons(8000);
  inet_pton(AF_INET, "127.0.0.1", &ip4addr.sin_addr);
  CHECK(connect(s, (struct sockaddr*) &ip4addr, sizeof(ip4addr)) == 0);

  CHECK(write(s, "GET / HTTP/1.0\r\n", 16) == 16);
  CHECK(write(s, "\r\n", 2) == 2);
  char buf[1];
  CHECK(read(s, buf, 1) == 1);
  close(s);

  return foo;
}

int main() {
  pthread_t children[N];

  int i;
  for (i = 0; i < N; i++) {
    CHECK(pthread_create(&children[i], NULL, child, NULL) == 0);
  }
  for (i = 0; i < N; i++) {
    CHECK(pthread_join(children[i], NULL) == 0);
  }
}
$ python -m SimpleHTTPServer
// move to another shell
$ gcc foo.c && ./a.out
read(s, buf, 1) == 1 failed: 54
failure: Connection reset by peer
Assertion failed: (0), function child, file foo.c, line 34.
zsh: abort      ./a.out

For all I know this could be just as much of a bug on python's side as it is on our side. Without being able to look closely at what's going on in python though, I can't tell.

Manishearth commented 9 years ago

It's an error when serving via node as well, so it seems more likely that we're messing up somewhere.

jdm commented 9 years ago

As an additional data point, the same server and tests are being run in Firefox without this issue making an appearance.

alexcrichton commented 9 years ago

@jdm @Manishearth I'm sorry but I don't have time to dig very far into the internals of python's or node's server implementations.

The C program I pasted above is super small and should be pretty easy to debug, and it's basically a pretty close translation of what we're doing in the standard library (with lots of error handling removed). The fact that I could write a small Rust server which doesn't reproduce the error makes me very suspicious about what we can do on our end to remedy this. I'd of course love to find a fix that we could apply though!

jdm commented 9 years ago

@alexcrichton: Interestingly enough, when I run the Rust server you pasted previously and bump N up to 150 in your C program, I also see the same output you get against the Python server.

jdm commented 9 years ago

I also see connection reset errors when running my Rust test against your server with range(0, 300).

alexcrichton commented 9 years ago

Ok, thanks for that info @jdm! I've managed to create a greatly reduced server:

#include <sys/types.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netdb.h>
#include <stdio.h>
#include <errno.h>
#include <pthread.h>
#include <assert.h>

#define N 200

#define CHECK(e) if (!(e)) {              \
    printf("%s failed: %d\n", #e, errno); \
    perror("failure");                    \
    assert(0);                            \
  }

int main() {
  int s = socket(AF_INET, SOCK_STREAM, 0);
  CHECK(s != -1);
  int opt = 1;
  CHECK(setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)) == 0);

  struct sockaddr_in ip4addr;
  ip4addr.sin_family = AF_INET;
  ip4addr.sin_port = htons(8000);
  inet_pton(AF_INET, "127.0.0.1", &ip4addr.sin_addr);
  CHECK(bind(s, (struct sockaddr*) &ip4addr, sizeof(ip4addr)) == 0);
  CHECK(listen(s, 1) == 0);

  while (1) {
    int c = accept(s, NULL, NULL);
    CHECK(c != -1);
    char buf[1];
    switch (read(c, buf, 1)) {
      case 0: printf("eof\n"); break;
      case 1: break;
      default: printf("read error\n"); break;
    }
    CHECK(write(c, "a", 1) == 1);
    close(c);
  }
}

The key part of this server is the parameter to listen, which in this case is 1. I can get the ECONNREFUSED error with a value of N=2 for the client C program I listed above. Checking the manpage of listen we see:

SYNOPSIS
     #include <sys/socket.h>

     int
     listen(int socket, int backlog);

DESCRIPTION
     Creation of socket-based connections requires several operations.
     First, a socket is created with socket(2).  Next, a willingness to
     accept incoming connections and a queue limit for incoming connections
     are specified with listen().  Finally, the connections are accepted
     with accept(2).  The listen() call applies only to sockets of type
     SOCK_STREAM or SOCK_SEQPACKET.

     The backlog parameter defines the maximum length for the queue of pend-
     ing connections.  If a connection request arrives with the queue full,
     the client may receive an error with an indication of ECONNREFUSED.
     Alternatively, if the underlying protocol supports retransmission, the
     request may be ignored so that retries may succeed.

I think this is basically a "welp, that's TCP" time of day. It sounds like you need to bump the server's backlog parameter or lower the number of concurrent connections you're making.

thestinger commented 9 years ago

The backlog helps cope with spikes in latency but you need to be handling the connections as quickly as they're coming or the same situation will occur. If the connections are very short then the dispatcher thread should really be handing them off via a bounded queue without making any system calls. The rest is just about server performance... it could just be that the servers are too slow to keep up.

jdm commented 9 years ago

Setting the python server's request_queue_size parameter did fix the problem we're seeing in our tests. Thanks!

alexcrichton commented 9 years ago

Glad to hear! I'm going to close this as working-as-intended in that case.