Precise code intel broken on Linux?

sqs commented 1 year ago

Repro:

On Sourcegraph App version 2023.03.27+210185.ae7f75, add the repo github.com/hashicorp/errwrap and then trigger a precise code nav indexing job for HEAD.
It fails at the Upload step.

The reason appears to be that the docker run command for the upload step is missing --add-host=host.docker.internal:host-gateway, which is added when the executors.frontendURL site config setting URL has the host host.docker.internal. The dockerOptions func that sets this sees c.FrontendURL as http://localhost:3080, which is NOT the URL that the commands running inside the executor's Docker containers can access the host Sourcegraph App at.

The fix might just be consulting c.ExecutorsFrontendURL() instead of c.FrontendURL in dockerOptions, or maybe it is more complex and Sourcegraph App needs to listen on more than just localhost so that Docker containers can contact it.

Upload step command:

docker run --rm --cpus 4 --memory 12G -v /tmp/.searcher.tmp/tmpfriend-3834734-671751445/workspace-1-2235544553:/data -w /data -e SRC_ENDPOINT=http://host.docker.internal:3080 -e SRC_HEADER_AUTHORIZATION=token-executor REDACTED --entrypoint /bin/sh sourcegraph/src-cli:5.0.0 /data/.sourcegraph-executor/1.2_github.com_hashicorp_errwrap@3622ed43b5c382494ba1496673ff8d4a9d4f6ea7.sh

Log output:

+ src lsif upload -no-progress -repo github.com/hashicorp/errwrap -commit 3622ed43b5c382494ba1496673ff8d4a9d4f6ea7 -root . -upload-route /.executors/lsif/upload -file dump.lsif -associated-index-id 1
Head "http://host.docker.internal:3080//.executors/scip/upload": dial tcp: lookup host.docker.internal on 192.168.86.1:53: no such host

slimsag commented 1 year ago

Leaving open until we confirm the fix. What needs to be done:

Test this works on macOS using sg start app or a release binary (not the macOS bundle)
Test this works on Linux using sg start app or a new release binary

Note that Linux behavior must be tested because #50020 there are differences with docker.host.internal not existing there, so best we make sure it is working.

umpox commented 1 year ago

Assigning myself to this (as I also need to make sure I can run sg start app).

I don't have a Linux machine though, @fkling you're on Linux right?

umpox commented 1 year ago

Confirmed working on macOS:

Steps followed:

Start app locally with sg start app (ran from commit: 82eca3c247)
Navigate to https://sourcegraph.test:3443/github.com/hashicorp/errwrap/-/code-graph/indexes
Trigger precise code index by enqueueing HEAD

Observations:

Upload complete (see above screenshot)
Precise navigation shows and works as expected for code

fkling commented 1 year ago

@umpox and I figured out why it doesn't work on Linux but it's not clear what the best resolution is.

The error I'm getting is

Head "http://host.docker.internal:3082//.executors/scip/upload": dial tcp 172.17.0.1:3082: connect: connection refused

As per @slimsag's suggestion I confirmed that the docker container was started with --add-host=host.docker.internal:host-gateway. So what's going on?

Root cause

It all boils down to the fact that the sourcegraph app is listen on interface 127.0.0.1, but docker gateway's interface/IP is 172.17.0.1. I.e. host.docker.internal resolves to the the 172.* address but the app is not listening on that interface.

To confirm this I made the following experiment:

Start a dummy HTTP server on 3082 (or any other port, it doesn't matter) via
```
while true; do echo -e "HTTP/1.1 200 OK\r\n$(date)\r\n\r\n" | nc -l -p 3082 -s 127.0.0.1 -q 1; done
```
This simulates the upload endpoint.

Connect to the endpoint from inside docker via

docker run --rm --add-host=host.docker.internal:host-gateway curlimages/curl curl --head 'http://host.docker.internal:3082//.executors/scip/upload'

This fails with a similar error:

curl: (7) Failed to connect to localhost port 3082 after 0 ms: Couldn't connect to server

NOTE: Various sources on the Internet seem to suggest that --add-host host.docker.internal:host-gateway should just work on Linux too, but these sources often don't state which interface(s) the target service listens on. If it's on all interfaces (0.0.0.0) then yes, it will work. If it's on localhost then no, or at least not in my case.

Possible solutions

There are two ways to make this work:

Make the service listen on a different interface. Listening on all interfaces (0.0.0.0 or omitting -s <IP> from the test command above), or explicitly listening on docker's gateway interface (172.17...., whatever it is called) both work.
Make the container run on the same network layer, via --network=host --add-host=host.docker.internal:127.0.0.1. I was able to confirm that with the sourcegraph binary itself by hacking these options into the command, as shown in the screenshot.

2023-04-20_15-02

Listening on all interfaces isn't a good solution for obvious reasons. If the sourcegraph binary only needs to expose those services to docker then having those services listen on the docker gateway interface seems to the best solution. Maybe that would even work in a OS agnostic way. Ideally we could find a solution that works on every OS. If not then we should have the ability to configure host or container options for different builds.

slimsag commented 1 year ago

Awesome work digging into this! That is a super helpful write-up!

Listening on all interfaces isn't a good solution for obvious reasons. If the sourcegraph binary only needs to expose those services to docker then having those services listen on the docker gateway interface seems to the best solution.

This seems like a sane approach to me; @eseliger do you have any thoughts/opinions here?

eseliger commented 1 year ago

Listening on the docker default network as well sounds reasonable to me for the App use-case. Not sure how easy it'd be to detect if that port needs to be claimed or not. It's weird to me that macOS and Linux differ here 😀

One thing to keep in mind, docker installations on Linux are generally more customized than they are on Mac where virtually everyone just uses whatever docker gives them. On Linux, other firewall things and such could still break us, but I guess that's "expected" if you configure your firewall in weird ways.

fkling commented 1 year ago

@eseliger are executors the only part that are run via docker in app?

I'm now thinking that it might make sense to create a separate, sourcegraph app specific network and have App listen on that interface specifically. Do you, or @slimsag know which services need to be accessible to docker and which need to be accessible on the host? (e.g. the web frontend obviously needs to accessible from the host).

This is what I'm seeing when I run sg start app:

tcp        0      0 127.0.0.1:3434          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3178          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3180          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3181          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3182          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3184          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3188          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3189          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3082          0.0.0.0:*               LISTEN      1904/.bin/sourcegra 
tcp        0      0 127.0.0.1:3090          0.0.0.0:*               LISTEN      1904/.bin/sourcegra            
tcp        0      0 127.0.0.1:6996          0.0.0.0:*               LISTEN      1904/.bin/sourcegra                  
tcp        0      0 127.0.0.1:4319          0.0.0.0:*               LISTEN      1904/.bin/sourcegra              
tcp        0      0 127.0.0.1:9000          0.0.0.0:*               LISTEN      1904/.bin/sourcegra               
tcp        0      0 127.0.0.1:9991          0.0.0.0:*               LISTEN      1904/.bin/sourcegra

What are these doing and which ones need to be exposed only to executors/other docker processes? Sorry for all the questions, my knowledge of the backend is limited... pointing me in the right direction (e.g. where all of this is configured) probably suffices.

eseliger commented 1 year ago

@eseliger are executors the only part that are run via docker in app?

To my understanding, yes, that's correct.

I'm now thinking that it might make sense to create a separate, sourcegraph app specific network and have App listen on that interface specifically. Do you, or @slimsag know which services need to be accessible to docker and which need to be accessible on the host? (e.g. the web frontend obviously needs to accessible from the host).

I think it could generally make sense for the executor to create a separate network to run the docker containers in, this is generally best practice for docker containers anyways to not rely on the default network. That doesn't even need to be app specific. The only service executors need to be able to reach is the public facing HTTP endpoint of the frontend service. That would probably be port 3180(?).

Is the separate docker network something you guys are able to take on or should we raise this with the executors people?

fkling commented 1 year ago

@eseliger

Is the separate docker network something you guys are able to take on or should we raise this with the executors people?

Given my limited understanding of the backend I'd appreciate it if someone else can look into this (probably more efficient).

The only service executors need to be able to reach is the public facing HTTP endpoint of the frontend service. That would probably be port 3180(?).

So this port is meant to be "world" accessible? Because that's what we currently seem to be doing. If that's fine, then maybe there is nothing to be done to make it work on Linux in production. Unfortunately I cannot build a Linux binary with the latest changes to confirm my assumptions. But it looks like only when running via sg we are overriding the interface for the frontend which causes the service to only listen on localhost and not on all interfaces. I don't know why we are doing this though.

In other words: While using a separate network might be the better way anyways, it might still not work in dev mode, due to the special setup.

eseliger commented 1 year ago

So this port is meant to be "world" accessible? Because that's what we currently seem to be doing. If that's fine, then maybe there is nothing to be done to make it work on Linux in production. Unfortunately I cannot build a Linux binary with the latest changes to confirm my assumptions. But it looks like only when running via sg we are overriding the interface for the frontend which causes the service to only listen on localhost and not on all interfaces. I don't know why we are doing this though.

Correct! Only the globally accessible interface of sourcegraph needs to be reachable by an executor. Let me know if that works or if we need to do more work here :)

sourcegraph / sourcegraph-public-snapshot

Precise code intel broken on Linux? #50011

Root cause

Possible solutions