mozilla / sccache

Sccache is a ccache-like tool. It is used as a compiler wrapper and avoids compilation when possible. Sccache has the capability to utilize caching in remote storage environments, including various cloud storage options, or alternatively, in local storage.
Apache License 2.0
5.74k stars 542 forks source link

Connection to server timed out #364

Open crawford opened 5 years ago

crawford commented 5 years ago

I'm trying to configure my client to connect to a remote scheduler that I have set up, but I'm getting "Connection to server timed out".

On the remote end, I have the scheduler running:

$ RUST_LOG=info ./target/release/sccache-dist scheduler --config scheduler.conf
 INFO 2019-01-28T20:36:20Z: sccache::dist::http::server: Scheduler listening for clients on <my ip address>:10600
 INFO 2019-01-28T20:36:26Z: sccache::dist::http::server: Adding new certificate for <my ip address>:10501 to scheduler
 INFO 2019-01-28T20:36:26Z: sccache_dist: Registered new server ServerId(V4(<my ip address>:10501))

And I have the server running:

$ RUST_LOG=info sudo -E target/release/sccache-dist server --config server.conf
 INFO 2019-01-28T20:36:16Z: sccache_dist::build: Creating overlay builder
 INFO 2019-01-28T20:36:16Z: sccache::dist::http::server: Server listening for clients on <my ip address>:10501

Based on those log lines, it looks like the scheduler and the build server are okay.

I've configured my local client as follows:

[dist]
# The URL used to connect to the scheduler (should use https, given an ideal
# setup of a HTTPS server in front of the scheduler)
scheduler_url = "http://<my ip address>:10600"
# Used for mapping local toolchains to remote cross-compile toolchains. Empty in
# this example where the client and build server are both Linux.
toolchains = []
# Size of the local toolchain cache, in bytes.
toolchain_cache_size = 1073741824

[dist.auth]
type = "token"
token = "<secret>"

When I try running the client though, it fails:

$ RUST_LOG=sccache=trace sccache --show-stats
TRACE 2019-01-28T20:53:04Z: sccache::cmdline: parse
DEBUG 2019-01-28T20:53:04Z: sccache::config: Attempting to read config file at "/home/alex/.config/sccache/config"
TRACE 2019-01-28T20:53:04Z: sccache::commands: Command::ShowStats(text)
TRACE 2019-01-28T20:53:04Z: sccache::commands: connect_or_start_server(4226)
TRACE 2019-01-28T20:53:04Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:04Z: sccache::commands: run_server_process
TRACE 2019-01-28T20:53:04Z: sccache::cmdline: parse
DEBUG 2019-01-28T20:53:04Z: sccache::config: Attempting to read config file at "/home/alex/.config/sccache/config"
TRACE 2019-01-28T20:53:04Z: sccache::commands: Command::InternalStartServer
TRACE 2019-01-28T20:53:09Z: sccache::client: connect_with_retry(4226)
TRACE 2019-01-28T20:53:09Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:09Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:10Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:10Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:11Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:11Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:12Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:12Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:13Z: sccache::client: connect_to_server(4226)
TRACE 2019-01-28T20:53:13Z: sccache::client: connect_to_server(4226)
error: Connection to server timed out

Looking at strace output shows that it never attempts to connect to the remote scheduler and build server:

$ strace -e trace=connect sccache --show-stats
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=20957, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
connect(3, {sa_family=AF_INET, sin_port=htons(4226), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
error: Connection to server timed out
+++ exited with 2 +++

All of the running components have been built from the 0.2.8 tag:

$ sccache --version
sccache 0.2.8
$ sccache-dist --version
sccache 0.2.8

I must be missing something obvious here. How do I tell sccache to connect to my remote server?

jwatt commented 5 years ago

I'm not an sccache dev, but my first though would be a firewall issue. Have you eliminated that? For example, what does the following output:

wget http://<your ip address>:10600/api/v1/scheduler/status

crawford commented 5 years ago

I'm running this on Container Linux which doesn't have a firewall by default. I'm able to GET that resource, but it's a binary blob. Is that expected?

$ curl -s http://<my ip address>:10600/api/v1/scheduler/status | xxd
00000000: 0100 0000 0000 0000                      ........
crawford commented 5 years ago

I had some time to dig into this today and it looks like the client doesn't directly talk to the remote server (explains all of the traces above). Instead, the client talks to the local (internal) server, which then talks to the remote server.

After ensuring the local server is running, I still don't see the remote server actually performing any work. From the scheduler, I see the following:

TRACE 2019-04-09T18:17:25Z: sccache::dist::http::server: Req 1 (<client>:1837): Request { method: "POST", url: "/api/v1/scheduler/alloc_job", headers: [("user-agent", "reqwest/0.9.5"), ("accept", "*/*"), ("authorization", "Bearer <token>"), ("content-type", "application/octet-stream"), ("content-length", "136"), ("accept-encoding", "gzip"), ("host", "<scheduler>:10600")], https: false, remote_addr: V4(<client>:1837) }
TRACE 2019-04-09T18:17:25Z: sccache::dist::http::server: Req 1: alloc_job: Toolchain { archive_id: "0559487cb33db1186380a6c418f139c274b8aa50d100bf5b760d24fbde44492aab61a1fc61ae7fc1ce01e52d33e633517d54e435abfc4599d99fddbcd583f7ca" }
 INFO 2019-04-09T18:17:25Z: sccache_dist: Job 1 created and will be assigned to server ServerId(V4(<server>:10501))
...
 WARN 2019-04-09T18:17:55Z: sccache::dist::http::server: Res 1 error: assign job failed, caused by: POST to scheduler assign_job failed, caused by: timed out

And on the server, I see the following:

TRACE 2019-04-09T18:17:25Z: sccache::dist::http::server: Req 1 (<scheduler>:56760): Request { method: "POST", url: "/api/v1/distserver/assign_job/1", headers: [("user-agent", "reqwest/0.9.5"), ("accept", "*/*"), ("authorization", "Bearer <some JWT>"), ("content-type", "application/octet-stream"), ("content-length", "136"), ("accept-encoding", "gzip"), ("host", "<server>:10501")], https: false, remote_addr: V4(<server>:56760) }
TRACE 2019-04-09T18:17:25Z: sccache::dist::http::server: Req 1: assign_job(1): Toolchain { archive_id: "0559487cb33db1186380a6c418f139c274b8aa50d100bf5b760d24fbde44492aab61a1fc61ae7fc1ce01e52d33e633517d54e435abfc4599d99fddbcd583f7ca" }
TRACE 2019-04-09T18:17:25Z: sccache::dist::http::server: Res 1: Response { status_code: 200, headers: [("Content-Type", "application/octet-stream")] }

It's not clear to me why the server is accepting the job but isn't actually working on it.

GeorgeHahn commented 4 years ago

Did you have any success getting this working? I am seeing the same behavior - the server accepts the job but the client times out on assign_job.

crawford commented 4 years ago

@GeorgeHahn I haven't had a chance to get this working, unfortunately. I'll definitely post updates once I've had a chance to dig into this further.

JJJollyjim commented 4 years ago

I'm encountering the same bug (on Arch Linux, sccache 0.2.13, scheduler and server are same machine). The server's trace claims it responds but the scheduler times out waiting for it.

bryanburgers commented 4 years ago

I tried to get this working by compiling from source on Ubuntu 20.04.

When I checked out the 0.2.13 tag, I had this same problem.

When I checked out master at 182c59ed6039fdbad3522b211d90c76acf8baa6f I could get it to work! So maybe it's been fixed since the 0.2.13 release?

anholt commented 4 years ago

I also had luck with master recently, but wasn't sure if I'd changed some configuration since replicating this bug.

Frederick888 commented 3 years ago

I'm experiencing the same issue as @crawford using 0.2.14 (Linux x86_64 across the board). Client can pack and upload the toolchain to the server, then the server can also unpack the toolchain, but after that the server never actually does any compilation and in scheduler's log I can see Res 50 error: assign job failed, job un-assigned from the server, caused by: POST to scheduler assign_job failed, caused by: https://<ip>:10501/api/v1/distserver/assign_job/31: timed out.

I tried reverting a6a8becb842198e2d04302bfdacdf1c66740a7b0 and now the 'timed out' messages are gone but server is still not doing any job.