oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
248 stars 38 forks source link

crucible pantry zone got the wrong control vnic #3098

Closed jmpesp closed 1 year ago

jmpesp commented 1 year ago

The profile states that oxControlService9 should be the datalink:

root@oxz_crucible_pantry:~# cat /var/svc/profile/site.xml
<!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
<service_bundle type="profile" name="omicron">
  <service version="1" type="service" name="oxide/crucible/pantry">
    <instance enabled="true" name="default">
      <property_group type="application" name="config">
        <propval type="astring" name="datalink" value="oxControlService9"/>
        <propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
        <propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
        <propval type="astring" name="listen_port" value="17000"/>
      </property_group>
    </instance>
  </service>

but the zone has the wrong one!

root@oxz_crucible_pantry:~# dladm
LINK        CLASS     MTU    STATE    BRIDGE     OVER
oxControlService8 vnic 9000  up       --         ?

The zone cfg:

james@dinnerbone:~$ pfexec zonecfg -z oxz_crucible_pantry info
zonename: oxz_crucible_pantry
zonepath: /zone/oxz_crucible_pantry
brand: omicron1
autoboot: false
bootargs: 
pool: 
limitpriv: 
scheduling-class: 
ip-type: exclusive
hostid: 
fs-allowed: 
net:
    address not specified
    allowed-address not specified
    physical: oxControlService8
    defrouter not specified
jmpesp commented 1 year ago

Looking at the sled agent logs, it seems like the sled agent inits the pantry zone twice:

$ pfexec grep -i pantry $(svcs -L svc:/oxide/sled-agent:default) | looker
16:05:54.819Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: CruciblePantry
16:05:54.819Z INFO SledAgent (BootstrapAgent): Service zone crucible_pantry does not yet exist
16:05:54.843Z INFO SledAgent (BootstrapAgent): Configuring new Omicron zone: oxz_crucible_pantry
16:05:54.859Z INFO SledAgent (BootstrapAgent): Installing Omicron zone: oxz_crucible_pantry
16:05:58.836Z INFO SledAgent (BootstrapAgent): Profile for oxz_crucible_pantry:
    <!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
    <service_bundle type="profile" name="omicron">
      <service version="1" type="service" name="oxide/crucible/pantry">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="datalink" value="oxControlService8"/>
            <propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
            <propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
            <propval type="astring" name="listen_port" value="17000"/>
          </property_group>
        </instance>
      </service>
    </service_bundle>
16:05:58.837Z INFO SledAgent (BootstrapAgent): Zone booting
    zone = oxz_crucible_pantry

16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: CruciblePantry
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone crucible_pantry does not yet exist
16:06:13.431Z INFO SledAgent (BootstrapAgent): install_omicron_zone: Found zone: oxz_crucible_pantry in state Running
16:06:13.431Z INFO SledAgent (BootstrapAgent): Profile for oxz_crucible_pantry:
    <!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
    <service_bundle type="profile" name="omicron">
      <service version="1" type="service" name="oxide/crucible/pantry">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="datalink" value="oxControlService9"/>
            <propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
            <propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
            <propval type="astring" name="listen_port" value="17000"/>
          </property_group>
        </instance>
      </service>
    </service_bundle>
16:06:13.431Z INFO SledAgent (BootstrapAgent): Zone booting
    zone = oxz_crucible_pantry

The second time, it found the pantry in state Running.

It seems odd that the sled agent couldn't find the crucible pantry zone, then found it in state Running...?

jmpesp commented 1 year ago

This may be due to multiple PUTs to /services during RSS for this particular sled:

james@frostypaws:~/omicron$ pfexec cat /var/svc/log/oxide-sled-agent:default.log | looker | grep 'initializing sled services' | grep 'fd00:1122:3344:101::' | cut -d'[' -f1
16:03:46.837Z INFO SledAgent (RSS): initializing sled services: 
16:05:13.275Z INFO SledAgent (RSS): initializing sled services: 
16:06:13.413Z INFO SledAgent (RSS): initializing sled services:

This corresponds with

6:05:58.836Z INFO SledAgent (BootstrapAgent): Profile for oxz_crucible_pantry:
    <!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
    <service_bundle type="profile" name="omicron">
      <service version="1" type="service" name="oxide/crucible/pantry">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="datalink" value="oxControlService8"/>
            <propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
            <propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
            <propval type="astring" name="listen_port" value="17000"/>
          </property_group>
        </instance>
      </service>
    </service_bundle>
16:05:58.837Z INFO SledAgent (BootstrapAgent): Zone booting
    zone = oxz_crucible_pantry

16:06:13.415Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:101::1]:12345
    remote_addr = [fd00:1122:3344:104::1]:38431

16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: ExternalDns
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone external_dns already exists
16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: Oximeter
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone oximeter already exists
16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: Nexus
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone nexus already exists
16:06:13.416Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: CruciblePantry
16:06:13.416Z INFO SledAgent (BootstrapAgent): Service zone crucible_pantry does not yet exist
16:06:13.431Z INFO SledAgent (BootstrapAgent): install_omicron_zone: Found zone: oxz_crucible_pantry in state Running
16:06:13.431Z INFO SledAgent (BootstrapAgent): Profile for oxz_crucible_pantry:
    <!DOCTYPE service_bundle SYSTEM "/usr/share/lib/xml/dtd/service_bundle.dtd.1">
    <service_bundle type="profile" name="omicron">
      <service version="1" type="service" name="oxide/crucible/pantry">
        <instance enabled="true" name="default">
          <property_group type="application" name="config">
            <propval type="astring" name="datalink" value="oxControlService9"/>
            <propval type="astring" name="gateway" value="fd00:1122:3344:101::1"/>
            <propval type="astring" name="listen_addr" value="fd00:1122:3344:101::9"/>
            <propval type="astring" name="listen_port" value="17000"/>
          </property_group>
        </instance>
      </service>
    </service_bundle>
16:06:13.431Z INFO SledAgent (BootstrapAgent): Zone booting
    zone = oxz_crucible_pantry
16:06:35.537Z INFO SledAgent (dropshot (SledAgent)): request completed
    local_addr = [fd00:1122:3344:101::1]:12345
    method = PUT
    remote_addr = [fd00:1122:3344:104::1]:38431
    req_id = afe8aa6b-ee55-4e16-999d-916396864dd5
    response_code = 204
    uri = /services

The crucible pantry zone is booting at 16:05:58.837Z, but isn't found in the existing zones list at 16:06:13.416Z.

jmpesp commented 1 year ago

Unfortunately I was unable to reproduce this after rebooting

smklein commented 1 year ago

If the double-PUT is the problem, https://github.com/oxidecomputer/omicron/issues/2977 is very related

jmpesp commented 1 year ago

I ran about 10 reboots with

diff --git a/sled-agent/src/services.rs b/sled-agent/src/services.rs
index d05140944..65856de9e 100644
--- a/sled-agent/src/services.rs
+++ b/sled-agent/src/services.rs
@@ -1787,13 +1787,25 @@ impl ServiceManager {
             if existing_zones.iter().any(|z| z.name() == expected_zone_name) {
                 info!(
                     self.inner.log,
-                    "Service zone {} already exists", req.zone.zone_type
+                    "Service zone {} already exists", &expected_zone_name
                 );
                 continue;
             } else {
+                match Zones::find(&expected_zone_name).await {
+                    Ok(zone) => {
+                        if let Some(zone) = zone {
+                            panic!("zone {expected_zone_name} does not exist in existing_zones, but it does! we found {:?}", zone);
+                        }
+                    }
+
+                    Err(e) => {
+                        panic!("saw error trying to find zone: {:?}", e);
+                    }
+                }
+
                 info!(
                     self.inner.log,
-                    "Service zone {} does not yet exist", req.zone.zone_type
+                    "Service zone {} does not yet exist", &expected_zone_name
                 );
             }

And finally saw it panic:

20:15:40.471Z INFO SledAgent: contacting server nexus, registering sled
    baseboard = Baseboard { identifier: "fa7f4ee88f2d4de4aaee435d97832cea", model: "i86pc", revision: 0 }
    id = bc991e2b-3059-4313-8973-8cd7507b3b35
    sled_id = bc991e2b-3059-4313-8973-8cd7507b3b35

20:15:41.463Z INFO SledAgent (BootstrapAgent): Zone booting
    zone = oxz_external_dns

WARNING: Failed to delete OPTE port overlay VNIC while dropping port. The VNIC will not be cleaned up properly, and the xde device itself 
will not be deleted. Both the VNIC and the xde device must be deleted out of band, and it will not be possible to recreate the xde device 
until then. Error: DeleteVnicError { name: "vopte2", err: CommandFailure(CommandFailureInfo { command: "/usr/sbin/dladm delete-vnic vopte2
", status: ExitStatus(unix_wait_status(256)), stdout: "", stderr: "dladm: vnic deletion failed: link busy\n" }) }

20:16:02.356Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:101::1]:12345
    remote_addr = [fd00:1122:3344:104::1]:33615

20:16:02.357Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: ExternalDns
thread 'tokio-runtime-worker' panicked at 'zone oxz_external_dns does not exist in existing_zones, but it does! we found Zone { id: Some(48), name: "oxz_external_dns", state: Running, path: "/zone/oxz_external_dns", uuid: Some("6b68cf0e-591c-e153-948e-99042f31281a"), brand: "omicron1", ip_type: Exclusive }', sled-agent/src/services.rs:1797:29
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

This looks like the same pattern:

Except this time the sled agent hits the panic I put in. I'm going to attempt another repro with

diff --git a/smf/sled-agent/manifest.xml b/smf/sled-agent/manifest.xml
index 5732aedd7..c7582b361 100644
--- a/smf/sled-agent/manifest.xml
+++ b/smf/sled-agent/manifest.xml
@@ -40,6 +40,7 @@
   <method_context>
     <method_environment>
       <envvar name="LD_LIBRARY_PATH" value="/opt/ooce/pgsql-13/lib/amd64" />
+      <envvar name="RUST_BACKTRACE" value="1" />
     </method_environment>
   </method_context>
   <exec_method type='method' name='start'

so that we can see a backtrace.

jmpesp commented 1 year ago
20:52:42.293Z INFO SledAgent (BootstrapAgent): Zone booting
    zone = oxz_crucible_pantry
20:52:56.408Z INFO SledAgent (dropshot (SledAgent)): accepted connection
    local_addr = [fd00:1122:3344:103::1]:12345
    remote_addr = [fd00:1122:3344:104::1]:62099
20:52:56.409Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: Nexus
20:52:56.410Z INFO SledAgent (BootstrapAgent): Service zone oxz_nexus already exists
20:52:56.410Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: ExternalDns
20:52:56.410Z INFO SledAgent (BootstrapAgent): Service zone oxz_external_dns already exists
20:52:56.410Z INFO SledAgent (BootstrapAgent): Ensuring service zone is initialized: CruciblePantry
thread 'tokio-runtime-worker' panicked at 'zone oxz_crucible_pantry does not exist in existing_zones, but it does! we found Zone { id: Some(58), name: "oxz_crucible_pantry", state: Running, path: "/zone/oxz_crucible_pantry", uuid: Some("77d817d9-0e8b-c33b-b8f1-c1d77561b029"), brand: "omicron1", ip_type: Exclusive }', sled-agent/src/services.rs:1797:29
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: omicron_sled_agent::services::ServiceManager::initialize_services_locked::{{closure}}
   3: <FuncType as dropshot::handler::HttpHandlerFunc<Context,(T0,),ResponseType>>::handle_request::{{closure}}
   4: <dropshot::handler::HttpRouteHandler<Context,HandlerType,FuncParams,ResponseType> as dropshot::handler::RouteHandler<Context>>::handle_request::{{closure}}
   5: dropshot::server::http_request_handle_wrap::{{closure}}
   6: <hyper::proto::h1::dispatch::Server<S,hyper::body::body::Body> as hyper::proto::h1::dispatch::Dispatch>::poll_msg
   7: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_loop
   8: hyper::proto::h1::dispatch::Dispatcher<D,Bs,I,T>::poll_catch
   9: <hyper::server::conn::upgrades::UpgradeableConnection<I,S,E> as core::future::future::Future>::poll
  10: <hyper::server::server::new_svc::NewSvcTask<I,N,S,E,W> as core::future::future::Future>::poll
  11: tokio::runtime::task::core::Core<T,S>::poll
  12: tokio::runtime::task::harness::Harness<T,S>::poll
  13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  14: tokio::runtime::scheduler::multi_thread::worker::Context::run
  15: tokio::runtime::scheduler::multi_thread::worker::run
  16: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
  17: tokio::runtime::task::core::Core<T,S>::poll
  18: tokio::runtime::task::harness::Harness<T,S>::poll
  19: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
jmpesp commented 1 year ago

I've managed to reproduce this with a small program:

use anyhow::anyhow;
use anyhow::Result;

use dropshot::endpoint;
use dropshot::ApiDescription;
use dropshot::HttpError;
use dropshot::HttpResponseOk;
use dropshot::RequestContext;
use dropshot::{ConfigLogging, ConfigLoggingIfExists, ConfigLoggingLevel};

use schemars::JsonSchema;
use serde::Serialize;

use std::sync::Arc;
use tokio::sync::Mutex;

struct Opaque {}

struct ServerContextInner {
    l: Mutex<Vec<Opaque>>,
}

#[derive(Clone)]
struct ServerContext {
    inner: Arc<ServerContextInner>,
}

pub async fn empty_function() {
}

pub async fn run_ls() {
    let mut command = tokio::process::Command::new("ls");
    command.output().await.unwrap();
}

impl ServerContext {
    pub async fn step_one(&self) {
        let mut ctx = self.inner.l.lock().await;
        self.step_two(&mut ctx).await;
    }

    pub async fn step_two(&self,
        _list: &mut Vec<Opaque>,
    ) {
        // Crash if this file exists
        if std::path::Path::new("foo").exists() {
            panic!("oh no! foo exists");
        }

        // Create the file - we've got the lock here, so we should be the only
        // one that sees this file!
        std::fs::write("foo", "bad!").unwrap();

        // Execute a program that sleeps for 4 seconds, then prints 'online'
        let output = std::process::Command::new("./prog.sh")
            .output()
            .unwrap();
        let stdout = String::from_utf8_lossy(&output.stdout);
        assert_eq!(stdout.trim(), "online");

        // If the original request timed out during the execution of prog.sh
        // above, the theory is that either dropshot or hyper is aborting this
        // task. This means that execution never returns after hitting this
        // await point, and the file never gets removed. 

        // But, it matters what future is being awaited on. Execution never
        // returns for the yield_now and sleep calls, but does for the
        // empty_function and run_ls calls.

        //tokio::task::yield_now().await;
        tokio::time::sleep(std::time::Duration::from_millis(100)).await;
        //empty_function().await;
        //run_ls().await

        // Remove the file
        std::fs::remove_file("foo").unwrap();
    }
}

#[derive(JsonSchema,Serialize)]
struct ResultBody {
    len: usize,
}

#[endpoint {
    method = PUT,
    path = "/foobar",
}]
async fn foobar(
    rqctx: RequestContext<ServerContext>,
) -> Result<HttpResponseOk<ResultBody>, HttpError> {
    let ctx = rqctx.context();

    ctx.step_one().await;

    Ok(HttpResponseOk(ResultBody { len: 0 }))
}

fn api() -> ApiDescription<ServerContext> {
    let mut api = ApiDescription::<ServerContext>::new();

    api.register(foobar).expect("foobar worked");

    api
}

#[tokio::main]
async fn main() -> Result<()> {
    let log = ConfigLogging::File {
        level: ConfigLoggingLevel::Info,
        path: "/dev/stdout".into(),
        if_exists: ConfigLoggingIfExists::Append,
    }
    .to_logger("playground")?;

    let server = dropshot::HttpServerStarter::new(
        &dropshot::ConfigDropshot {
            bind_address: "0.0.0.0:12345".parse().unwrap(),
            request_body_max_bytes: 1024 * 10,
            ..Default::default()
        },
        api(),
        ServerContext {
            inner: Arc::new(
                ServerContextInner {
                    l: Mutex::new(vec![]),
                }
            ),
        },
        &log.new(slog::o!("component" => "dropshot")),
    )
    .map_err(|e| anyhow!("creating server: {:?}", e))?
    .start();

    server
        .await
        .map_err(|e| anyhow!("starting server: {:?}", e))
}

I've noted in the comments what I think is happening here:

With another terminal running two of the example curl commands above, the dropshot logs look like:

13:51:20.885Z INFO playground (dropshot): listening
    local_addr = 0.0.0.0:12345
13:51:22.185Z INFO playground (dropshot): accepted connection
    local_addr = 0.0.0.0:12345
    remote_addr = 127.0.0.1:41542
13:51:26.188Z INFO playground (dropshot): accepted connection
    local_addr = 0.0.0.0:12345
    remote_addr = 127.0.0.1:41554
thread 'tokio-runtime-worker' panicked at 'oh no! foo exists', src/main.rs:48:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

For the case where curl is called without the timeout, the logs look like:

13:57:04.865Z INFO playground (dropshot): listening
    local_addr = 0.0.0.0:12345
13:57:09.692Z INFO playground (dropshot): accepted connection
    local_addr = 0.0.0.0:12345
    remote_addr = 127.0.0.1:34684
13:57:13.796Z INFO playground (dropshot): request completed
    local_addr = 0.0.0.0:12345
    method = PUT
    remote_addr = 127.0.0.1:34684
    req_id = fceedcdc-6180-4586-8235-1117f99633ac
    response_code = 200
    uri = /foobar
13:57:16.436Z INFO playground (dropshot): accepted connection
    local_addr = 0.0.0.0:12345
    remote_addr = 127.0.0.1:34690
13:57:20.540Z INFO playground (dropshot): request completed
    local_addr = 0.0.0.0:12345
    method = PUT
    remote_addr = 127.0.0.1:34690
    req_id = d4c7b4e3-ecfe-4090-82c2-cac2b125b471
    response_code = 200
    uri = /foobar

This explains what I was seeing: the original call to PUT /services times out as the zones are being brought up, and a call is made after a short delay. The zones had booted and had been partially configured, but not added to the existing_zones vec because execution never came back to finish the function.