[Simulated Omicron] Fails to start up when runing pieces individually due to `Failed to lookup mgd address:` when running locally

karencfv commented 1 year ago

~When I run the cargo run --bin omicron-dev -- run-all command, simulated omicron fails to start up with the following output:~

~coatlicue@pop-os:\~/src/omicron$ cargo run --bin omicron-dev -- run-all Finished dev [unoptimized + debuginfo] target(s) in 0.32s Running target/debug/omicron-dev run-all omicron-dev: setting up all services ... log file: /tmp/omicron-dev-omicron-dev.1091696.0.log note: configured to log to "/tmp/omicron-dev-omicron-dev.1091696.0.log" DB URL: postgresql://root@[::1]:34507/omicron?sslmode=disable DB address: [::1]:34507 log file: /tmp/omicron-dev-omicron-dev.1091696.1.log note: configured to log to "/tmp/omicron-dev-omicron-dev.1091696.1.log" thread 'main' panicked at 'called Result::unwrap() on an Err value: failed to discover dendrite port from files in /tmp/.tmpcEaWcn~

~Caused by: 0: time out while discovering dendrite port number 1: deadline has elapsed', /home/coatlicue/src/omicron/nexus/test-utils/src/lib.rs:423:72 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Aborted (core dumped)~

UPDATE: The above is fixed by @david-crespo's suggestion to run the prereqs script again, but I am still seeing the behaviour below.

When I try to run the pieces separately, the cargo run --bin=nexus -- nexus/examples/config.toml command returns:

Nov 03 00:15:59.680 INFO SEC running, sec_id: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c, component: SEC, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c, file: /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/steno-0.4.0/src/sec.rs:811
Nov 03 00:16:14.726 WARN Failed to lookup mgd address: Cannot lookup mgd addresses: request timed out, component: nexus, component: ServerContext, name: e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c, file: nexus/src/app/mod.rs:307

and cargo run --bin=sled-agent-sim -- $(uuidgen) [::1]:12345 [::1]:12221 --rss-nexus-external-addr 127.0.0.1:12220 --rss-external-dns-internal-addr [::1]:5353 --rss-internal-dns-dns-addr [::1]:3535 returns:

Nov 03 00:17:44.965 WARN failed to contact nexus, will retry in 17.393268136s, error: Communication Error: error sending request for url (http://[::1]:12221/sled-agents/1f913b0a-962d-4d28-a706-9c436d3691ad): error trying to connect: tcp connect error: Connection refused (os error 111), file: sled-agent/src/sim/server.rs:123

I am unsure if the documentation needs to be updated or why this is failing locally and not in CI :woman_shrugging:

david-crespo commented 1 year ago

I can’t remember whether you’re on a Mac, but getting the right version of dendrite might fix this.

See https://github.com/oxidecomputer/omicron/issues/4333#issuecomment-1778377620

david-crespo commented 1 year ago

Oh, I see you’re on Linux. In that case try running the prereqs scripts again. Do tests work? I don’t think I tried run-all since making the above work on my machine.

karencfv commented 1 year ago

Thanks @david-crespo ! Running the prereqs again worked for running --run-all but the separate pieces still fail :(

karencfv commented 1 year ago

I can’t remember whether you’re on a Mac,

I gave up on Mac a while ago :smile:

david-crespo commented 1 year ago

Glad that worked. run-all does a whole lot more than start Nexus and Sled Agent these days, so I don't think running those alone is expected to work.

https://github.com/oxidecomputer/omicron/blob/dbf01fddbfc9c9b836c173f70ed80340c9230d09/nexus/test-utils/src/lib.rs#L956-L984

karencfv commented 1 year ago

Seems like running with TLS still requires people to spin up the individual pieces by hand, we probably want that flow to work if that's the case.

david-crespo commented 1 year ago

Oh, I see. I've never seen that before. Yeah, that looks pretty out of date. It's possible nobody is regularly running Nexus locally with both TLS and simulated sled agent. And if there is someone, they probably haven't done it since BGP was added, which is where the mgd error comes from. I expect most people running the system with TLS to be running the full system.

davepacheco commented 1 year ago

Being able to run pieces by hand is very useful. If it doesn't work, I think we should fix it. It may be that we now depend on some new component (mgd?) that needs to be started by hand too and we can just add instructions to start that (however the run-all command does it)?

davepacheco commented 4 months ago

I ran into this again on Helios today. I was not able to get the fully manual simulated workflow working.

Workaround

I was able to run a Nexus manually with a stack started with omicron-dev run-all. It was basically:

Use omicron-dev run-all to run stuff
Munge the stock Nexus config file as needed to match what omicron-dev run-all stood up
Start Nexus manually the way the docs say

Here's the full delta for my config file against what's in nexus/examples/config.toml today:

$ diff nexus/examples/config.toml config.toml 
4a5,7
> mgd.switch0.address = "[::1]:42745"
> mgd.switch1.address = "[::1]:44345"
> 
17c20
< level = "info"
---
> level = "debug"
20c23
< mode = "stderr-terminal"
---
> #mode = "stderr-terminal"
23,25c26,28
< #mode = "file"
< #path = "logs/server.log"
< #if_exists = "append"
---
> mode = "file"
> path = "nexus.log"
> if_exists = "append"
33c36
< id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
---
> id = "333518da-5601-43e8-b9bd-f675571c5797"
34a38
> techport_external_server_port = 0
42c46
< bind_address = "127.0.0.1:12220"
---
> bind_address = "127.0.0.1:12222"
59c63
< bind_address = "[::1]:12221"
---
> bind_address = "[::1]:12223"
67c71
< address = "[::1]:3535"
---
> address = "[::1]:63080"
72c76
< url = "postgresql://root@[::1]:32221/omicron?sslmode=disable"
---
> url = "postgresql://root@[::1]:36897/omicron?sslmode=disable"

Here's what my omicron-dev run-all spit out:

$ cargo run --bin=omicron-dev -- run-all
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.96s
     Running `target/debug/omicron-dev run-all`
omicron-dev: setting up all services ... 
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.0.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.0.log"
DB URL: postgresql://root@[::1]:36897/omicron?sslmode=disable
DB address: [::1]:36897
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.2.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.2.log"
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.3.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.3.log"
omicron-dev: services are running.
omicron-dev: nexus external API:    127.0.0.1:12220
omicron-dev: nexus internal API:    [::1]:12221
omicron-dev: cockroachdb pid:       26446
omicron-dev: cockroachdb URL:       postgresql://root@[::1]:36897/omicron?sslmode=disable
omicron-dev: cockroachdb directory: /dangerzone/omicron_tmp/.tmpJStNuk
omicron-dev: internal DNS HTTP:     http://[::1]:43816
omicron-dev: internal DNS:          [::1]:63080
omicron-dev: external DNS name:     oxide-dev.test
omicron-dev: external DNS HTTP:     http://[::1]:54655
omicron-dev: external DNS:          [::1]:42219
omicron-dev:   e.g. `dig @::1 -p 42219 test-suite-silo.sys.oxide-dev.test`
omicron-dev: management gateway:    http://[::1]:44345 (switch1)
omicron-dev: management gateway:    http://[::1]:42745 (switch0)
omicron-dev: silo name:             test-suite-silo
omicron-dev: privileged user name:  test-privileged

mgd.switch0.address and mgd.switch1.address: these work around the issue mentioned here. These ports need to match what omicron-dev run-all spits out for "management gateway". Note that it seems to print switches 0 and 1 in a random order and it's easy to get these backwards, especially if you do this multiple times.
I messed with the logging but you don't have to to get this to work.
You need to choose a new id because omicron-dev uses the same one as the example config file which means you'd be starting up a second instance of the same Nexus :-o
techport_external_server_port = 0 appears needed to have two instances running on the same system. I think this should probably always be in the example config and I'll file a follow-on issue.
The bind ports for the internal and external dropshot servers need to differ from what omicron-dev uses. It'd be nice to fix this but it's annoying. For stuff that you might interact with, omicron-dev picks the same ports as the manual run-through so that you get the same end result. But that means it'll necessarily conflict with doing this twice. Maybe we should just have the example use different ports. (They already wind up with a different silo name and password so it's not exactly the same.)
There are a few ports where omicron-dev picks any available port but the manual run-through assumes a bunch of fixed ports. This applies to internal DNS and the PostgreSQL URL. The internal DNS port also doesn't seem to match the how-to-run-simulated docs any more. I'm pretty sure it used to.

This flow works but it's pretty painful because every time you want to start from scratch you need to copy and paste a bunch of values from the omicron-dev output again into a new config file. By comparison, when the how-to-run-simulated instructions actually worked, they used fixed port numbers so that you could just rerun the same commands again.

Next steps?

To get the fully manual flow working again, we seem to at least add instructions for running mgd manually. CC @rcgoodfellow (not sure who's the right person to look at this).

Some of the issues I mentioned above seem like separate issues with the manual flow, unrelated to the mgd error. And since the manual flow has grown a lot more complicated, the middle ground of running omicron-dev run-all and then some components manually may be useful. It'd be nice to document that. I'll see about throwing together a PR for this stuff.

davepacheco commented 4 months ago

See #6075 for the second part of what I mentioned.

rcgoodfellow commented 4 months ago

To get the fully manual flow working again, we seem to at least add instructions for running mgd manually. CC @rcgoodfellow

I'm trying to understand how mgd fits into the simulated workflow, and manually starting things. Something that strikes me as a bit odd, is the need to manually start mgd but not dendrite's dpd (I don't see manual instructions for that here). My initial plan was to just look at how dendrite is being treated in this environment and follow suit, but now I'm a bit confused on what's going on with this environment.

davepacheco commented 4 months ago

I wouldn’t be surprised if dpd was just missing too.

The intent of this flow is to achieve the same kind of environment that omicron-dev does, which is the same kind of environment that we get in the test suite. It just gives you more control over execution. The history is that we started with this flow and the test suite (I don’t remember which was first), then we added omicron-dev run-all. I gather we’ve been adding stuff to the test suite (which is mostly automatically picked up by omicron-dev) but not adding stuff here and it’s possible a few things are missing.

oxidecomputer / omicron

[Simulated Omicron] Fails to start up when runing pieces individually due to `Failed to lookup mgd address:` when running locally #4421

Workaround

Next steps?