Closed jmpesp closed 1 year ago
Looking at the sled agent logs, it seems like the sled agent inits the pantry zone twice:
$ pfexec grep -i pantry $(svcs -L svc:/oxide/sled-agent:default) | looker
16:05:54.819Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: CruciblePantry
16:05:54.819Z INFO SledAgent (BootstrapAgent): Service zone crucible_pantry does not yet exist
16:05:54.843Z INFO SledAgent (BootstrapAgent): Configuring new Omicron zone: oxz_crucible_pantry
16:05:54.859Z INFO SledAgent (BootstrapAgent): Installing Omicron zone: oxz_crucible_pantry
16:05:58.836Z INFO SledAgent (BootstrapAgent): Profile for oxz_crucible_pantry:
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="omicron">
<service version="1" type="service" name="oxide/crucible/pantry">
<instance enabled="true" name="default">
<property_group type="application" name="config">
<propval type="astring" name="datalink" value="oxControlService8"/>
<propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
<propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
<propval type="astring" name="listen_port" value="17000"/>
</property_group>
</instance>
</service>
</service_bundle>
16:05:58.837Z INFO SledAgent (BootstrapAgent): Zone booting
zone = oxz_crucible_pantry
16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: CruciblePantry
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone crucible_pantry does not yet exist
16:06:13.431Z INFO SledAgent (BootstrapAgent): install_omicron_zone: Found zone: oxz_crucible_pantry in state Running
16:06:13.431Z INFO SledAgent (BootstrapAgent): Profile for oxz_crucible_pantry:
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="omicron">
<service version="1" type="service" name="oxide/crucible/pantry">
<instance enabled="true" name="default">
<property_group type="application" name="config">
<propval type="astring" name="datalink" value="oxControlService9"/>
<propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
<propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
<propval type="astring" name="listen_port" value="17000"/>
</property_group>
</instance>
</service>
</service_bundle>
16:06:13.431Z INFO SledAgent (BootstrapAgent): Zone booting
zone = oxz_crucible_pantry
The second time, it found the pantry in state Running.
It seems odd that the sled agent couldn't find the crucible pantry zone, then found it in state Running...?
This may be due to multiple PUTs to /services during RSS for this particular sled:
james@frostypaws:~/omicron$ pfexec cat /var/svc/log/oxide-sled-agent:default.log | looker | grep 'initializing sled services' | grep 'fd00:1122:3344:101::' | cut -d'[' -f1
16:03:46.837Z INFO SledAgent (RSS): initializing sled services:
16:05:13.275Z INFO SledAgent (RSS): initializing sled services:
16:06:13.413Z INFO SledAgent (RSS): initializing sled services:
This corresponds with
6:05:58.836Z INFO SledAgent (BootstrapAgent): Profile for oxz_crucible_pantry:
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="omicron">
<service version="1" type="service" name="oxide/crucible/pantry">
<instance enabled="true" name="default">
<property_group type="application" name="config">
<propval type="astring" name="datalink" value="oxControlService8"/>
<propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
<propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
<propval type="astring" name="listen_port" value="17000"/>
</property_group>
</instance>
</service>
</service_bundle>
16:05:58.837Z INFO SledAgent (BootstrapAgent): Zone booting
zone = oxz_crucible_pantry
16:06:13.415Z INFO SledAgent (dropshot (SledAgent)): accepted connection
local_addr = [fd00:1122:3344:101::1]:12345
remote_addr = [fd00:1122:3344:104::1]:38431
16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: ExternalDns
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone external_dns already exists
16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: Oximeter
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone oximeter already exists
16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: Nexus
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone nexus already exists
16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: CruciblePantry
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone crucible_pantry does not yet exist
16:06:13.431Z INFO SledAgent (BootstrapAgent): install_omicron_zone: Found zone: oxz_crucible_pantry in state Running
16:06:13.431Z INFO SledAgent (BootstrapAgent): Profile for oxz_crucible_pantry:
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="omicron">
<service version="1" type="service" name="oxide/crucible/pantry">
<instance enabled="true" name="default">
<property_group type="application" name="config">
<propval type="astring" name="datalink" value="oxControlService9"/>
<propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
<propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
<propval type="astring" name="listen_port" value="17000"/>
</property_group>
</instance>
</service>
</service_bundle>
16:06:13.431Z INFO SledAgent (BootstrapAgent): Zone booting
zone = oxz_crucible_pantry
16:06:35.537Z INFO SledAgent (dropshot (SledAgent)): request completed
local_addr = [fd00:1122:3344:101::1]:12345
method = PUT
remote_addr = [fd00:1122:3344:104::1]:38431
req_id = afe8aa6b-ee55-4e16-999d-916396864dd5
response_code = 204
uri = /services
The crucible pantry zone is booting at 16:05:58.837Z, but isn't found in the existing zones list at 16:06:13.416Z.
Unfortunately I was unable to reproduce this after rebooting
If the double-PUT is the problem, https://github.com/oxidecomputer/omicron/issues/2977 is very related
I ran about 10 reboots with
diff --git a/sled-agent/src/services.rs b/sled-agent/src/services.rs
index d05140944..65856de9e 100644
--- a/sled-agent/src/services.rs
+++ b/sled-agent/src/services.rs
@@ -1787,13 +1787,25 @@ impl ServiceManager {
if existing_zones.iter().any(|z| z.name() == expected_zone_name) {
info!(
self.inner.log,
- "Service zone {} already exists", req.zone.zone_type
+ "Service zone {} already exists", &expected_zone_name
);
continue;
} else {
+ match Zones::find(&expected_zone_name).await {
+ Ok(zone) => {
+ if let Some(zone) = zone {
+ panic!("zone {expected_zone_name} does not exist in existing_zones, but it does! we found {:?}", zone);
+ }
+ }
+
+ Err(e) => {
+ panic!("saw error trying to find zone: {:?}", e);
+ }
+ }
+
info!(
self.inner.log,
- "Service zone {} does not yet exist", req.zone.zone_type
+ "Service zone {} does not yet exist", &expected_zone_name
);
}
And finally saw it panic:
20:15:40.471Z INFO SledAgent: contacting server nexus, registering sled
baseboard = Baseboard { identifier: "fa7f4ee88f2d4de4aaee435d97832cea", model: "i86pc", revision: 0 }
id = bc991e2b-3059-4313-8973-8cd7507b3b35
sled_id = bc991e2b-3059-4313-8973-8cd7507b3b35
20:15:41.463Z INFO SledAgent (BootstrapAgent): Zone booting
zone = oxz_external_dns
WARNING: Failed to delete OPTE port overlay VNIC while dropping port. The VNIC will not be cleaned up properly, and the xde device itself
will not be deleted. Both the VNIC and the xde device must be deleted out of band, and it will not be possible to recreate the xde device
until then. Error: DeleteVnicError { name: "vopte2", err: CommandFailure(CommandFailureInfo { command: "/usr/sbin/dladm delete-vnic vopte2
", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "dladm: vnic deletion failed: link busy\n" }) }
20:16:02.356Z INFO SledAgent (dropshot (SledAgent)): accepted connection
local_addr = [fd00:1122:3344:101::1]:12345
remote_addr = [fd00:1122:3344:104::1]:33615
20:16:02.357Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: ExternalDns
thread 'tokio-runtime-worker' panicked at 'zone oxz_external_dns does not exist in existing_zones, but it does! we found Zone { id: Some(48), name: "oxz_external_dns", state: Running, path: "/zone/oxz_external_dns", uuid: Some("6b68cf0e-591c-e153-948e-99042f31281a"), brand: "omicron1", ip_type: Exclusive }', sled-agent/src/services.rs:1797:29
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
This looks like the same pattern:
Except this time the sled agent hits the panic I put in. I'm going to attempt another repro with
diff --git a/smf/sled-agent/manifest.xml b/smf/sled-agent/manifest.xml
index 5732aedd7..c7582b361 100644
--- a/smf/sled-agent/manifest.xml
+++ b/smf/sled-agent/manifest.xml
@@ -40,6 +40,7 @@
<method_context>
<method_environment>
<envvar name="LD_LIBRARY_PATH" value="/opt/ooce/pgsql-13/lib/amd64" />
+ <envvar name="RUST_BACKTRACE" value="1" />
</method_environment>
</method_context>
<exec_method type='method' name='start'
so that we can see a backtrace.
20:52:42.293Z INFO SledAgent (BootstrapAgent): Zone booting
zone = oxz_crucible_pantry
20:52:56.408Z INFO SledAgent (dropshot (SledAgent)): accepted connection
local_addr = [fd00:1122:3344:103::1]:12345
remote_addr = [fd00:1122:3344:104::1]:62099
20:52:56.409Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: Nexus
20:52:56.410Z INFO SledAgent (BootstrapAgent): Service zone oxz_nexus already exists
20:52:56.410Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: ExternalDns
20:52:56.410Z INFO SledAgent (BootstrapAgent): Service zone oxz_external_dns already exists
20:52:56.410Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: CruciblePantry
thread 'tokio-runtime-worker' panicked at 'zone oxz_crucible_pantry does not exist in existing_zones, but it does! we found Zone { id: Some(58), name: "oxz_crucible_pantry", state: Running, path: "/zone/oxz_crucible_pantry", uuid: Some("77d817d9-0e8b-c33b-b8f1-c1d77561b029"), brand: "omicron1", ip_type: Exclusive }', sled-agent/src/services.rs:1797:29
stack backtrace:
0: rust_begin_unwind
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
1: core::panicking::panic_fmt
at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
2: omicron_sled_agent::services::ServiceManager::initialize_services_locked::{{closure}}
3: <FuncType as dropshot::handler::HttpHandlerFunc<Context,(T0,),ResponseType>>::handle_request::{{closure}}
4: <dropshot::handler::HttpRouteHandler<Context,HandlerType,FuncParams,ResponseType> as dropshot::handler::RouteHandler<Context>>::handle_request::{{closure}}
5: dropshot::server::http_request_handle_wrap::{{closure}}
6: <hyper::proto::h1::dispatch::Server<S,hyper::body::body::Body> as hyper::proto::h1::dispatch::Dispatch>::poll_msg
7: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_loop
8: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch
9: <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll
10: <hyper::server::server::new_svc::NewSvcTask<I,N,S,E,W> as core::future::future::Future>::poll
11: tokio::runtime::task::core::Core<T,S>::poll
12: tokio::runtime::task::harness::Harness<T,S>::poll
13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
14: tokio::runtime::scheduler::multi_thread::worker::Context::run
15: tokio::runtime::scheduler::multi_thread::worker::run
16: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
17: tokio::runtime::task::core::Core<T,S>::poll
18: tokio::runtime::task::harness::Harness<T,S>::poll
19: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
I've managed to reproduce this with a small program:
use anyhow::anyhow;
use anyhow::Result;
use dropshot::endpoint;
use dropshot::ApiDescription;
use dropshot::HttpError;
use dropshot::HttpResponseOk;
use dropshot::RequestContext;
use dropshot::{ConfigLogging, ConfigLoggingIfExists, ConfigLoggingLevel};
use schemars::JsonSchema;
use serde::Serialize;
use std::sync::Arc;
use tokio::sync::Mutex;
struct Opaque {}
struct ServerContextInner {
l: Mutex<Vec<Opaque>>,
}
#[derive(Clone)]
struct ServerContext {
inner: Arc<ServerContextInner>,
}
pub async fn empty_function() {
}
pub async fn run_ls() {
let mut command = tokio::process::Command::new("ls");
command.output().await.unwrap();
}
impl ServerContext {
pub async fn step_one(&self) {
let mut ctx = self.inner.l.lock().await;
self.step_two(&mut ctx).await;
}
pub async fn step_two(&self,
_list: &mut Vec<Opaque>,
) {
// Crash if this file exists
if std::path::Path::new("foo").exists() {
panic!("oh no! foo exists");
}
// Create the file - we've got the lock here, so we should be the only
// one that sees this file!
std::fs::write("foo", "bad!").unwrap();
// Execute a program that sleeps for 4 seconds, then prints 'online'
let output = std::process::Command::new("./prog.sh")
.output()
.unwrap();
let stdout = String::from_utf8_lossy(&output.stdout);
assert_eq!(stdout.trim(), "online");
// If the original request timed out during the execution of prog.sh
// above, the theory is that either dropshot or hyper is aborting this
// task. This means that execution never returns after hitting this
// await point, and the file never gets removed.
// But, it matters what future is being awaited on. Execution never
// returns for the yield_now and sleep calls, but does for the
// empty_function and run_ls calls.
//tokio::task::yield_now().await;
tokio::time::sleep(std::time::Duration::from_millis(100)).await;
//empty_function().await;
//run_ls().await
// Remove the file
std::fs::remove_file("foo").unwrap();
}
}
#[derive(JsonSchema,Serialize)]
struct ResultBody {
len: usize,
}
#[endpoint {
method = PUT,
path = "/foobar",
}]
async fn foobar(
rqctx: RequestContext<ServerContext>,
) -> Result<HttpResponseOk<ResultBody>, HttpError> {
let ctx = rqctx.context();
ctx.step_one().await;
Ok(HttpResponseOk(ResultBody { len: 0 }))
}
fn api() -> ApiDescription<ServerContext> {
let mut api = ApiDescription::<ServerContext>::new();
api.register(foobar).expect("foobar worked");
api
}
#[tokio::main]
async fn main() -> Result<()> {
let log = ConfigLogging::File {
level: ConfigLoggingLevel::Info,
path: "/dev/stdout".into(),
if_exists: ConfigLoggingIfExists::Append,
}
.to_logger("playground")?;
let server = dropshot::HttpServerStarter::new(
&dropshot::ConfigDropshot {
bind_address: "0.0.0.0:12345".parse().unwrap(),
request_body_max_bytes: 1024 * 10,
..Default::default()
},
api(),
ServerContext {
inner: Arc::new(
ServerContextInner {
l: Mutex::new(vec![]),
}
),
},
&log.new(slog::o!("component" => "dropshot")),
)
.map_err(|e| anyhow!("creating server: {:?}", e))?
.start();
server
.await
.map_err(|e| anyhow!("starting server: {:?}", e))
}
I've noted in the comments what I think is happening here:
PUT /foobar
route is called, spawning a task and running the endpoint functionsleep 4; echo online
curl -m 1 -sq http://127.0.0.1:12345/foobar -X PUT
)tokio::time::sleep
is hitWith another terminal running two of the example curl commands above, the dropshot logs look like:
13:51:20.885Z INFO playground (dropshot): listening
local_addr = 0.0.0.0:12345
13:51:22.185Z INFO playground (dropshot): accepted connection
local_addr = 0.0.0.0:12345
remote_addr = 127.0.0.1:41542
13:51:26.188Z INFO playground (dropshot): accepted connection
local_addr = 0.0.0.0:12345
remote_addr = 127.0.0.1:41554
thread 'tokio-runtime-worker' panicked at 'oh no! foo exists', src/main.rs:48:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
For the case where curl is called without the timeout, the logs look like:
13:57:04.865Z INFO playground (dropshot): listening
local_addr = 0.0.0.0:12345
13:57:09.692Z INFO playground (dropshot): accepted connection
local_addr = 0.0.0.0:12345
remote_addr = 127.0.0.1:34684
13:57:13.796Z INFO playground (dropshot): request completed
local_addr = 0.0.0.0:12345
method = PUT
remote_addr = 127.0.0.1:34684
req_id = fceedcdc-6180-4586-8235-1117f99633ac
response_code = 200
uri = /foobar
13:57:16.436Z INFO playground (dropshot): accepted connection
local_addr = 0.0.0.0:12345
remote_addr = 127.0.0.1:34690
13:57:20.540Z INFO playground (dropshot): request completed
local_addr = 0.0.0.0:12345
method = PUT
remote_addr = 127.0.0.1:34690
req_id = d4c7b4e3-ecfe-4090-82c2-cac2b125b471
response_code = 200
uri = /foobar
This explains what I was seeing: the original call to PUT /services
times out as the zones are being brought up, and a call is made after a short delay. The zones had booted and had been partially configured, but not added to the existing_zones
vec because execution never came back to finish the function.
The profile states that oxControlService9 should be the datalink:
but the zone has the wrong one!
The zone cfg: