oxidecomputer / falcon

Fast Assessment Laboratory for Computers On Networks
Mozilla Public License 2.0
5 stars 2 forks source link

Node port selection is racy #82

Closed rcgoodfellow closed 3 weeks ago

rcgoodfellow commented 1 month ago

This code picks ports that are not currently in use and then uses those ports to launch an instance. In the time between picking the port and using the port, the port may be used out from under us. This appeared to happen today.

https://github.com/oxidecomputer/falcon/blob/c7cdcfc40e0472fe3b66393760e744dd71e6a1ea/lib/src/lib.rs#L579-L587

rcgoodfellow commented 1 month ago

There are two ports that we use. One for the propolis API and another for a VNC port. We can probably avoid this problem entirely for VNC by just leaving it up to the user to specify the desired VNC port in the model file if they, in fact, desire VNC.

For the propolis API server port, we can use port 0 which indicates the OS should pick at the time the propolis instance starts, and then grab the port from the log file or from pfiles.

For example

ry@masaka:~/src/testbed/a4x2/.falcon$ pgrep -lf g0
24880 /usr/bin/propolis-server-vlan run .falcon/g0.toml [::]:16106 [::]:18949
ry@masaka:~/src/testbed/a4x2/.falcon$ pfexec pfiles 24880 | grep INET
    sockname: AF_INET6 ::  port: 16106
    sockname: AF_INET6 ::  port: 18949
ry@masaka:~/src/testbed/a4x2/.falcon$ cat g0.out | grep local_addr |  grep listening | looker
21:38:24.088Z INFO propolis-server: listening
    local_addr = [::]:16106
davepacheco commented 1 month ago

In case the raw data is useful: the reason we think a port collision happened is that my a4x2 launch failed with:

Aug 07 21:51:08.164 INFO ce: mounting /opt/cargo-bay
Aug 07 21:51:09.220 INFO cr1: mounting /opt/cargo-bay
Communication Error: error sending request for url (http://[::1]:16934/instance): error trying to connect: tcp connect error: Connection refused (os error 146)
Aug 07 21:51:10.468 ERRO launch failed: Communication Error: error sending request for url (http://[::1]:16934/instance): error trying to connect: tcp connect error: Connection refused (os error 146)
Aug 07 21:51:10.468 INFO ce: init start
Aug 07 21:51:10.469 INFO cr1: init start
Aug 07 21:51:10.469 INFO cr2: init start
Aug 07 21:51:10.469 WARN [sc] cr2: IO error: Connection refused (os error 146)
Aug 07 21:51:10.470 INFO g0: init start
Aug 07 21:51:10.470 INFO g1: init start
Aug 07 21:51:10.470 INFO g2: init start
Aug 07 21:51:10.470 INFO g3: init start
Aug 07 21:51:11.471 WARN [sc] cr2: IO error: Connection refused (os error 146)
Aug 07 21:51:12.473 WARN [sc] cr2: IO error: Connection refused (os error 146)
Aug 07 21:51:12.932 INFO cr1: init finish
Aug 07 21:51:13.474 WARN [sc] cr2: IO error: Connection refused (os error 146)
Aug 07 21:51:14.475 WARN [sc] cr2: IO error: Connection refused (os error 146)
Aug 07 21:51:15.476 WARN [sc] cr2: IO error: Connection refused (os error 146)
Aug 07 21:51:16.478 WARN [sc] cr2: IO error: Connection refused (os error 146)
Aug 07 21:51:17.479 WARN [sc] cr2: IO error: Connection refused (os error 146)
Aug 07 21:51:18.145 INFO ce: init finish
Aug 07 21:51:18.480 WARN [sc] cr2: IO error: Connection refused (os error 146)
...
Error: cr2: init IO error: Connection refused (os error 146)

There are processes for the other nodes, but not cr2:

dap@ivanova a4x2 $ ps -opid,args -p "$(pgrep propolis-server)"
  PID COMMAND
 9242 propolis-server run .falcon/ce.toml [::]:21843 [::]:16936
 9243 propolis-server run .falcon/cr1.toml [::]:19662 [::]:18624
 9245 propolis-server run .falcon/g0.toml [::]:22071 [::]:21076
 9246 propolis-server run .falcon/g1.toml [::]:16966 [::]:24634
 9247 propolis-server run .falcon/g2.toml [::]:24335 [::]:17057
 9248 propolis-server run .falcon/g3.toml [::]:15673 [::]:23039

We thought maybe I'd forgotten to destroy the previous run, but my shell history shows I did, and the start time of the processes and the Falcon files seems to agree:

dap@ivanova a4x2 $ ps -opid,stime,args -p "$(pgrep propolis-server)"
  PID    STIME COMMAND
 9242 14:49:51 propolis-server run .falcon/ce.toml [::]:21843 [::]:16936
 9243 14:49:51 propolis-server run .falcon/cr1.toml [::]:19662 [::]:18624
 9245 14:49:51 propolis-server run .falcon/g0.toml [::]:22071 [::]:21076
 9246 14:49:51 propolis-server run .falcon/g1.toml [::]:16966 [::]:24634
 9247 14:49:51 propolis-server run .falcon/g2.toml [::]:24335 [::]:17057
 9248 14:49:51 propolis-server run .falcon/g3.toml [::]:15673 [::]:23039
dap@ivanova a4x2 $ ls -l .falcon/cr2.*
-rw-r--r--   1 root     root          45 Aug  7 14:49 .falcon/cr2.err
-rw-r--r--   1 root     root         389 Aug  7 14:49 .falcon/cr2.out
-rw-r--r--   1 root     root           4 Aug  7 14:49 .falcon/cr2.pid
-rw-r--r--   1 root     root           5 Aug  7 14:49 .falcon/cr2.port
-rw-r--r--   1 root     root         881 Aug  7 14:49 .falcon/cr2.toml
-rw-r--r--   1 root     root          36 Aug  7 14:49 .falcon/cr2.uuid
-rw-r--r--   1 root     root           5 Aug  7 14:49 .falcon/cr2.vnc_port

And here's the output:

dap@ivanova a4x2 $ looker < .falcon/cr2.out 
21:49:50.671Z INFO propolis-server: metric registration is disabled, no metric data will be produced by this server
21:49:50.672Z INFO propolis-server: reservoir too small (0MiB) to use for guest memory
dap@ivanova a4x2 $ cat .falcon/cr2.err 
Error: Address already in use (os error 125)

I expected we'd find some other process with the same port open. But which port? The EADDRINUSE error message doesn't say but the error suggests it would have been listening on 16934 and that's also what's in cr2.port:

dap@ivanova a4x2 $ cat .falcon/cr2.port 
16934

There's nothing else listening on this port, though:

dap@ivanova a4x2 $ netstat -a | grep 16934
dap@ivanova a4x2 $ 

@rcgoodfellow your comment made me realize it could be the VNC port, which is 16936. Bingo:

dap@ivanova a4x2 $ netstat -a | grep 16936
      *.16936              *.*                 0      0 128000      0 LISTEN
      *.16936                           *.*                              0      0 128000      0 LISTEN      

And actually we can see from the arguments above for the "ce" node that it looks like it's supposed to listen on 16936, and it is:

$ pfexec pfiles 9242
9242:   propolis-server run .falcon/ce.toml [::]:21843 [::]:16936
...
  10: S_IFSOCK mode:0666 dev:561,0 ino:1491809529 uid:0 gid:0 rdev:0,0
      O_RDWR|O_NONBLOCK|O_CLOEXEC
    SOCK_STREAM
    SO_REUSEADDR,SO_SNDBUF(49152),SO_RCVBUF(128000)
    sockname: AF_INET6 ::  port: 16936
...

So I think that's pretty conclusive about what happened here.

davepacheco commented 1 month ago

CockroachDB uses a nice pattern for this with the --listening-url-file FILE argument. You give it a file name and once it starts listening it writes out the whole URL (including the port it picked) to that file. It's basically like the log scraper but more stable. It may not be worth the trouble here since we control the log file and we can also make it fail clearly if we can't find the listening port or something.