Open karencfv opened 1 year ago
I can’t remember whether you’re on a Mac, but getting the right version of dendrite might fix this.
See https://github.com/oxidecomputer/omicron/issues/4333#issuecomment-1778377620
Oh, I see you’re on Linux. In that case try running the prereqs scripts again. Do tests work? I don’t think I tried run-all since making the above work on my machine.
Thanks @david-crespo ! Running the prereqs again worked for running --run-all
but the separate pieces still fail :(
I can’t remember whether you’re on a Mac,
I gave up on Mac a while ago :smile:
Glad that worked. run-all
does a whole lot more than start Nexus and Sled Agent these days, so I don't think running those alone is expected to work.
Seems like running with TLS still requires people to spin up the individual pieces by hand, we probably want that flow to work if that's the case.
Oh, I see. I've never seen that before. Yeah, that looks pretty out of date. It's possible nobody is regularly running Nexus locally with both TLS and simulated sled agent. And if there is someone, they probably haven't done it since BGP was added, which is where the mgd
error comes from. I expect most people running the system with TLS to be running the full system.
Being able to run pieces by hand is very useful. If it doesn't work, I think we should fix it. It may be that we now depend on some new component (mgd?) that needs to be started by hand too and we can just add instructions to start that (however the run-all
command does it)?
I ran into this again on Helios today. I was not able to get the fully manual simulated workflow working.
I was able to run a Nexus manually with a stack started with omicron-dev run-all
. It was basically:
omicron-dev run-all
to run stuffomicron-dev run-all
stood upHere's the full delta for my config file against what's in nexus/examples/config.toml today:
$ diff nexus/examples/config.toml config.toml
4a5,7
> mgd.switch0.address = "[::1]:42745"
> mgd.switch1.address = "[::1]:44345"
>
17c20
< level = "info"
---
> level = "debug"
20c23
< mode = "stderr-terminal"
---
> #mode = "stderr-terminal"
23,25c26,28
< #mode = "file"
< #path = "logs/server.log"
< #if_exists = "append"
---
> mode = "file"
> path = "nexus.log"
> if_exists = "append"
33c36
< id = "e6bff1ff-24fb-49dc-a54e-c6a350cd4d6c"
---
> id = "333518da-5601-43e8-b9bd-f675571c5797"
34a38
> techport_external_server_port = 0
42c46
< bind_address = "127.0.0.1:12220"
---
> bind_address = "127.0.0.1:12222"
59c63
< bind_address = "[::1]:12221"
---
> bind_address = "[::1]:12223"
67c71
< address = "[::1]:3535"
---
> address = "[::1]:63080"
72c76
< url = "postgresql://root@[::1]:32221/omicron?sslmode=disable"
---
> url = "postgresql://root@[::1]:36897/omicron?sslmode=disable"
Here's what my omicron-dev run-all
spit out:
$ cargo run --bin=omicron-dev -- run-all
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.96s
Running `target/debug/omicron-dev run-all`
omicron-dev: setting up all services ...
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.0.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.0.log"
DB URL: postgresql://root@[::1]:36897/omicron?sslmode=disable
DB address: [::1]:36897
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.2.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.2.log"
log file: /dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.3.log
note: configured to log to "/dangerzone/omicron_tmp/omicron-dev-omicron-dev.26442.3.log"
omicron-dev: services are running.
omicron-dev: nexus external API: 127.0.0.1:12220
omicron-dev: nexus internal API: [::1]:12221
omicron-dev: cockroachdb pid: 26446
omicron-dev: cockroachdb URL: postgresql://root@[::1]:36897/omicron?sslmode=disable
omicron-dev: cockroachdb directory: /dangerzone/omicron_tmp/.tmpJStNuk
omicron-dev: internal DNS HTTP: http://[::1]:43816
omicron-dev: internal DNS: [::1]:63080
omicron-dev: external DNS name: oxide-dev.test
omicron-dev: external DNS HTTP: http://[::1]:54655
omicron-dev: external DNS: [::1]:42219
omicron-dev: e.g. `dig @::1 -p 42219 test-suite-silo.sys.oxide-dev.test`
omicron-dev: management gateway: http://[::1]:44345 (switch1)
omicron-dev: management gateway: http://[::1]:42745 (switch0)
omicron-dev: silo name: test-suite-silo
omicron-dev: privileged user name: test-privileged
mgd.switch0.address
and mgd.switch1.address
: these work around the issue mentioned here. These ports need to match what omicron-dev run-all
spits out for "management gateway". Note that it seems to print switches 0 and 1 in a random order and it's easy to get these backwards, especially if you do this multiple times.omicron-dev
uses the same one as the example config file which means you'd be starting up a second instance of the same Nexus :-otechport_external_server_port = 0
appears needed to have two instances running on the same system. I think this should probably always be in the example config and I'll file a follow-on issue.omicron-dev
picks any available port but the manual run-through assumes a bunch of fixed ports. This applies to internal DNS and the PostgreSQL URL. The internal DNS port also doesn't seem to match the how-to-run-simulated docs any more. I'm pretty sure it used to.This flow works but it's pretty painful because every time you want to start from scratch you need to copy and paste a bunch of values from the omicron-dev
output again into a new config file. By comparison, when the how-to-run-simulated instructions actually worked, they used fixed port numbers so that you could just rerun the same commands again.
To get the fully manual flow working again, we seem to at least add instructions for running mgd manually. CC @rcgoodfellow (not sure who's the right person to look at this).
Some of the issues I mentioned above seem like separate issues with the manual flow, unrelated to the mgd error. And since the manual flow has grown a lot more complicated, the middle ground of running omicron-dev run-all
and then some components manually may be useful. It'd be nice to document that. I'll see about throwing together a PR for this stuff.
See #6075 for the second part of what I mentioned.
To get the fully manual flow working again, we seem to at least add instructions for running mgd manually. CC @rcgoodfellow
I'm trying to understand how mgd
fits into the simulated workflow, and manually starting things. Something that strikes me as a bit odd, is the need to manually start mgd
but not dendrite's dpd
(I don't see manual instructions for that here). My initial plan was to just look at how dendrite is being treated in this environment and follow suit, but now I'm a bit confused on what's going on with this environment.
I wouldn’t be surprised if dpd was just missing too.
The intent of this flow is to achieve the same kind of environment that omicron-dev does, which is the same kind of environment that we get in the test suite. It just gives you more control over execution. The history is that we started with this flow and the test suite (I don’t remember which was first), then we added omicron-dev run-all. I gather we’ve been adding stuff to the test suite (which is mostly automatically picked up by omicron-dev) but not adding stuff here and it’s possible a few things are missing.
~When I run the
cargo run --bin omicron-dev -- run-all
command, simulated omicron fails to start up with the following output:~~coatlicue@pop-os:\~/src/omicron$ cargo run --bin omicron-dev -- run-all Finished dev [unoptimized + debuginfo] target(s) in 0.32s Running
target/debug/omicron-dev run-all
omicron-dev: setting up all services ... log file: /tmp/omicron-dev-omicron-dev.1091696.0.log note: configured to log to "/tmp/omicron-dev-omicron-dev.1091696.0.log" DB URL: postgresql://root@[::1]:34507/omicron?sslmode=disable DB address: [::1]:34507 log file: /tmp/omicron-dev-omicron-dev.1091696.1.log note: configured to log to "/tmp/omicron-dev-omicron-dev.1091696.1.log" thread 'main' panicked at 'calledResult::unwrap()
on anErr
value: failed to discover dendrite port from files in /tmp/.tmpcEaWcn~~Caused by: 0: time out while discovering dendrite port number 1: deadline has elapsed', /home/coatlicue/src/omicron/nexus/test-utils/src/lib.rs:423:72 note: run with
RUST_BACKTRACE=1
environment variable to display a backtrace Aborted (core dumped)~UPDATE: The above is fixed by @david-crespo's suggestion to run the prereqs script again, but I am still seeing the behaviour below.
When I try to run the pieces separately, the
cargo run --bin=nexus -- nexus/examples/config.toml
command returns:and
cargo run --bin=sled-agent-sim -- $(uuidgen) [::1]:12345 [::1]:12221 --rss-nexus-external-addr 127.0.0.1:12220 --rss-external-dns-internal-addr [::1]:5353 --rss-internal-dns-dns-addr [::1]:3535
returns:I am unsure if the documentation needs to be updated or why this is failing locally and not in CI :woman_shrugging: