Closed jdm closed 9 years ago
Note: I've only seen this behaviour on Mac, never on Linux.
Here's an even better testcase that doesn't involve rust-http at all:
use std::io::{IoResult};
use std::io::net::get_host_addresses;
use std::io::net::ip::{SocketAddr, Ipv4Addr};
use std::io::net::tcp::TcpStream;
use std::task;
static TARGET: &'static str = "localhost";
fn url_to_socket_addr(host: &str) -> IoResult<SocketAddr> {
// Just grab the first IPv4 address
let addrs = try!(get_host_addresses(host));
let addr = addrs.into_iter().find(|&a| {
match a {
Ipv4Addr(..) => true,
_ => false
}
});
// TODO: Error handling
let addr = addr.unwrap();
let port = 8000;
Ok(SocketAddr {
ip: addr,
port: port
})
}
fn main() {
for _ in range(0u32, 10u32) {
task::spawn(proc() {
let addr = url_to_socket_addr(TARGET).unwrap();
let mut stream = TcpStream::connect(addr).unwrap();
(write!(stream, "GET / HTTP/1.0\r\n")).unwrap();
(write!(stream, "\r\n")).unwrap();
stream.flush().unwrap();
match stream.read_byte() {
Ok(_) => {
stream.read_to_end().unwrap();
println!("success!");
}
Err(e) => println!("{}", e.desc),
}
});
}
}
I have the same problem when I perform 300 TcpStream connections simultaneously to Memcached (Address is 127.0.0.1:11211).
It could be reproduced every time when I run my benchmark program.
This is my test program
I tested it on my Laptop (Mac OS X 10.10.1, MacBook Pro with Retina Display Late 2013).
I have also encountered this error when working with tcp, also OS X 10.10.
@jdm I cannot reproduce the error with this server program, do you have a standalone server error I can play around with? Also, was this a recent regression, or has this been happening for some time now?
use std::io::{TcpListener, Listener, Acceptor};
fn main() {
let mut l = TcpListener::bind("127.0.0.1:8000").unwrap().listen().unwrap();
for mut s in l.incoming() {
let _ = s.read_exact(18);
let _ = s.write([1]);
}
}
I've only tested this against the servers I specified in my original comment. I don't know if this is a recent regression; we only started getting enough information out of our test harness to diagnose this recently. FWIW, we see it using a 9/23 nightly.
I used python -m SimpleHTTPServer
and node node_modules/http-server/bin/http-server
. Nothing special.
Thanks @jdm! I've reproduced locally and I hope to have time to investigate tonight.
I think I would like to see a reproduction of this with some known server running that can be debugged easily. This current setup reproduces the problem seen here, and it should mirror basically what we're doing in Rust:
#include <sys/types.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netdb.h>
#include <stdio.h>
#include <errno.h>
#include <pthread.h>
#include <assert.h>
#define N 10
#define CHECK(e) if (!(e)) { \
printf("%s failed: %d\n", #e, errno); \
perror("failure"); \
assert(0); \
}
void *child(void *foo) {
int s = socket(AF_INET, SOCK_STREAM, 0);
CHECK(s != -1);
struct sockaddr_in ip4addr;
ip4addr.sin_family = AF_INET;
ip4addr.sin_port = htons(8000);
inet_pton(AF_INET, "127.0.0.1", &ip4addr.sin_addr);
CHECK(connect(s, (struct sockaddr*) &ip4addr, sizeof(ip4addr)) == 0);
CHECK(write(s, "GET / HTTP/1.0\r\n", 16) == 16);
CHECK(write(s, "\r\n", 2) == 2);
char buf[1];
CHECK(read(s, buf, 1) == 1);
close(s);
return foo;
}
int main() {
pthread_t children[N];
int i;
for (i = 0; i < N; i++) {
CHECK(pthread_create(&children[i], NULL, child, NULL) == 0);
}
for (i = 0; i < N; i++) {
CHECK(pthread_join(children[i], NULL) == 0);
}
}
$ python -m SimpleHTTPServer
// move to another shell
$ gcc foo.c && ./a.out
read(s, buf, 1) == 1 failed: 54
failure: Connection reset by peer
Assertion failed: (0), function child, file foo.c, line 34.
zsh: abort ./a.out
For all I know this could be just as much of a bug on python's side as it is on our side. Without being able to look closely at what's going on in python though, I can't tell.
It's an error when serving via node as well, so it seems more likely that we're messing up somewhere.
As an additional data point, the same server and tests are being run in Firefox without this issue making an appearance.
@jdm @Manishearth I'm sorry but I don't have time to dig very far into the internals of python's or node's server implementations.
The C program I pasted above is super small and should be pretty easy to debug, and it's basically a pretty close translation of what we're doing in the standard library (with lots of error handling removed). The fact that I could write a small Rust server which doesn't reproduce the error makes me very suspicious about what we can do on our end to remedy this. I'd of course love to find a fix that we could apply though!
@alexcrichton: Interestingly enough, when I run the Rust server you pasted previously and bump N up to 150 in your C program, I also see the same output you get against the Python server.
I also see connection reset errors when running my Rust test against your server with range(0, 300).
Ok, thanks for that info @jdm! I've managed to create a greatly reduced server:
#include <sys/types.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/socket.h>
#include <netdb.h>
#include <stdio.h>
#include <errno.h>
#include <pthread.h>
#include <assert.h>
#define N 200
#define CHECK(e) if (!(e)) { \
printf("%s failed: %d\n", #e, errno); \
perror("failure"); \
assert(0); \
}
int main() {
int s = socket(AF_INET, SOCK_STREAM, 0);
CHECK(s != -1);
int opt = 1;
CHECK(setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt)) == 0);
struct sockaddr_in ip4addr;
ip4addr.sin_family = AF_INET;
ip4addr.sin_port = htons(8000);
inet_pton(AF_INET, "127.0.0.1", &ip4addr.sin_addr);
CHECK(bind(s, (struct sockaddr*) &ip4addr, sizeof(ip4addr)) == 0);
CHECK(listen(s, 1) == 0);
while (1) {
int c = accept(s, NULL, NULL);
CHECK(c != -1);
char buf[1];
switch (read(c, buf, 1)) {
case 0: printf("eof\n"); break;
case 1: break;
default: printf("read error\n"); break;
}
CHECK(write(c, "a", 1) == 1);
close(c);
}
}
The key part of this server is the parameter to listen
, which in this case is 1. I can get the ECONNREFUSED error with a value of N=2 for the client C program I listed above. Checking the manpage of listen
we see:
SYNOPSIS
#include <sys/socket.h>
int
listen(int socket, int backlog);
DESCRIPTION
Creation of socket-based connections requires several operations.
First, a socket is created with socket(2). Next, a willingness to
accept incoming connections and a queue limit for incoming connections
are specified with listen(). Finally, the connections are accepted
with accept(2). The listen() call applies only to sockets of type
SOCK_STREAM or SOCK_SEQPACKET.
The backlog parameter defines the maximum length for the queue of pend-
ing connections. If a connection request arrives with the queue full,
the client may receive an error with an indication of ECONNREFUSED.
Alternatively, if the underlying protocol supports retransmission, the
request may be ignored so that retries may succeed.
I think this is basically a "welp, that's TCP" time of day. It sounds like you need to bump the server's backlog parameter or lower the number of concurrent connections you're making.
The backlog helps cope with spikes in latency but you need to be handling the connections as quickly as they're coming or the same situation will occur. If the connections are very short then the dispatcher thread should really be handing them off via a bounded queue without making any system calls. The rest is just about server performance... it could just be that the servers are too slow to keep up.
Setting the python server's request_queue_size parameter did fix the problem we're seeing in our tests. Thanks!
Glad to hear! I'm going to close this as working-as-intended in that case.
Using rust-http master and rustc master, the following program intermittently returns a number of "Server returned malformed HTTP response" errors, which under the hood are "connection reset" errors from libstd. I've run this against both Python's SimpleHTTPServer and node's http-server module with no change in behaviour; when run against a remote server I never see any connection errors.
Interesting notes: