oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
169 stars 18 forks source link

dsc failed to start a downstairs #1498

Open leftwo opened 2 weeks ago

leftwo commented 2 weeks ago

In CI (link may be bad) job: https://buildomat.eng.oxide.computer/wg/0/artefact/01J9KZDBT7Q76BMAZ3NFT2EM6G/JHuMk44VV03fuovquSYDTdL9uCEruARhbYRD7WhRUTif7Lam/01J9KZDV6QXGVBQG0ZHA6JNXTH/01J9M3TRMJHBD7KTGG9QWWK022/dsc-out.txt

We had the test timeout. Looking at test log output, we can see a new check we added is waiting for all the downstairs to respond:


120 | 2024-10-07T17:27:21.088Z | + ptime -m bash /input/build/work/scripts/test_up.sh -r 2 encrypted
-- | -- | --
121 | 2024-10-07T17:27:21.099Z | /input/build/work
122 | 2024-10-07T17:27:21.111Z | Using 2 region sets
123 | 2024-10-07T17:27:21.126Z | Upstairs using key: 1gql5EL5hhzi5hY5rXSvSfN31jxffOX/9IhgN0kkwMI=
124 | 2024-10-07T17:27:21.129Z | dsc output goes to /tmp/test_up-build/dsc-out.txt
125 | 2024-10-07T17:27:21.129Z | Creating 6 downstairs regions
126 | 2024-10-07T17:27:21.586Z | Starting 6 downstairs
127 | 2024-10-07T17:27:21.589Z | dsc started at PID: 1191
128 | 2024-10-07T17:27:26.657Z | Downstairs client 0 not running, waiting for it
129 | 2024-10-07T17:27:31.718Z | Downstairs client 0 not running, waiting for it
130 | 2024-10-07T17:27:36.778Z | Downstairs client 0 not running, waiting for it

The waiting for it waits till the test hits a timeout and we abort it. In the logs for dsc we can see it got to Starting on client 0, but never to Running:

/var/tmp/bins/dsc create --encrypted --cleanup --region-count 6 --extent-size 10 --extent-count 5 --output-dir /tmp/test_up-build/dsc --ds-bin /var/tmp/bins/crucible-downstairs --region-dir /var/tmp/test_up-build
Removing existing dsc directory "/tmp/test_up-build/dsc"
Creating region directory at: /var/tmp/test_up-build
Creating dsc directory at: /tmp/test_up-build/dsc
Downstairs region 0 created at /var/tmp/test_up-build/8810 in 0.26448342
Downstairs region 1 created at /var/tmp/test_up-build/8820 in 0.036061257
Downstairs region 2 created at /var/tmp/test_up-build/8830 in 0.03765977
Downstairs region 3 created at /var/tmp/test_up-build/8840 in 0.047207728
Downstairs region 4 created at /var/tmp/test_up-build/8850 in 0.024012322
Downstairs region 5 created at /var/tmp/test_up-build/8860 in 0.022937205
Created 6 regions
/var/tmp/bins/dsc start --output-dir /tmp/test_up-build/dsc --ds-bin /var/tmp/bins/crucible-downstairs --region-dir /var/tmp/test_up-build --region-count 6
Using existing output directory "/tmp/test_up-build/dsc"
Update our region info with: Some(RegionExtentInfo { block_size: 4096, blocks_per_extent: 10, extent_count: 5 })
start ds: 8810
start ds: 8820
start ds: 8830
start ds: 8840
start ds: 8850
start access at:127.0.0.1:9998
start ds: 8860
Starting downstairs at port 8810
Starting downstairs at port 8830
Make output file at "/tmp/test_up-build/dsc/downstairs-8810.txt"
Make output file at "/tmp/test_up-build/dsc/downstairs-8830.txt"
Starting downstairs at port 8820
Make output file at "/tmp/test_up-build/dsc/downstairs-8820.txt"
Starting downstairs at port 8840
Make output file at "/tmp/test_up-build/dsc/downstairs-8840.txt"
Starting downstairs at port 8850
Make output file at "/tmp/test_up-build/dsc/downstairs-8850.txt"
Starting downstairs at port 8860
Make output file at "/tmp/test_up-build/dsc/downstairs-8860.txt"
[8810][0] initial start wait reports Starting
[8830][2] initial start wait reports Starting
[8820][1] initial start wait reports Starting
[8840][3] initial start wait reports Starting
[8850][4] initial start wait reports Starting
[8860][5] initial start wait reports Starting
Downstairs /var/tmp/test_up-build/8850 port 8850 PID:Some(1193)
[8850][4] initial start wait reports Running
Downstairs /var/tmp/test_up-build/8820 port 8820 PID:Some(1196)
[8820][1] initial start wait reports Running
Downstairs /var/tmp/test_up-build/8840 port 8840 PID:Some(1197)
[8840][3] initial start wait reports Running
Downstairs /var/tmp/test_up-build/8860 port 8860 PID:Some(1195)
[8860][5] initial start wait reports Running
Downstairs /var/tmp/test_up-build/8830 port 8830 PID:Some(1198)
[8830][2] initial start wait reports Running
Oct 07 17:27:21.192 INFO listening, local_addr: 127.0.0.1:9998
Control access at:127.0.0.1:9998
Oct 07 17:27:26.229 INFO accepted connection, remote_addr: 127.0.0.1:35620, local_addr: 127.0.0.1:9998
Oct 07 17:27:26.231 INFO request completed, latency_us: 407, response_code: 200, uri: /state/cid/0, method: GET, req_id: 0ffa13a9-ee3c-4b4c-b7a6-c51ce6f2348d, remote_addr: 127.0.0.1:35620, local_addr: 127.0.0.1:9998
Oct 07 17:27:31.292 INFO accepted connection, remote_addr: 127.0.0.1:42984, local_addr: 127.0.0.1:9998
Oct 07 17:27:31.292 INFO request completed, latency_us: 158, response_code: 200, uri: /state/cid/0, method: GET, req_id: a61885b0-a975-4400-867a-6c121fb549a0, remote_addr: 127.0.0.1:42984, local_addr: 127.0.0.1:9998
Oct 07 17:27:36.352 INFO accepted connection, remote_addr: 127.0.0.1:41958, local_addr: 127.0.0.1:9998
Oct 07 17:27:36.353 INFO request completed, latency_us: 165, response_code: 200, uri: /state/cid/0, method: GET, req_id: 3fda854a-0c0b-4ccf-ae32-810e7bf7a3b5, remote_addr: 127.0.0.1:41958, local_addr: 127.0.0.1:9998
Oct 07 17:27:41.413 INFO accepted connection, remote_addr: 127.0.0.1:55992, local_addr: 127.0.0.1:9998
Oct 07 17:27:41.413 INFO request completed, latency_us: 168, response_code: 200, uri: /state/cid/0, method: GET, req_id: 839a976b-489e-4375-9fef-16311ae7953e, remote_addr: 127.0.0.1:55992, local_addr: 127.0.0.1:9998
Oct 07 17:27:46.474 INFO accepted connection, remote_addr: 127.0.0.1:45792, local_addr: 127.0.0.1:9998
Oct 07 17:27:46.475 INFO request completed, latency_us: 159, response_code: 200, uri: /state/cid/0, method: GET, req_id: 15182d54-4131-4011-9787-fb84d2f63414, remote_addr: 127.0.0.1:45792, local_addr: 127.0.0.1:9998
Oct 07 17:27:51.537 INFO accepted connection, remote_addr: 127.0.0.1:34146, local_addr: 127.0.0.1:9998

A third bit of information, if I look at the output for client 0, port 8810, I do see that it started:

{"msg":"Opened existing region file \"/var/tmp/test_up-build/8810/region.json\"","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.199648921Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"Database read version 1","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.201163781Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"Database write version 1","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.20118688Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"UUID: 12345678-0000-0000-0000-000000008810","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.20120484Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"Blocks per extent:10 Total Extents: 5","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.201221359Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194}
{"msg":"Crucible Version: Crucible Version: 0.0.1\nCommit SHA: fb9332b3671c250552cf7fa7cbd0a2572ab72b79\nCommit timestamp: 2024-10-07T17:07:42.000000000Z  branch: alan/test-more-regions\nrustc: 1.80.0 stable x86_64-unknown-illumos\nCargo: x86_64-unknown-illumos  Debug: true Opt level: 0","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.290495955Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}
{"msg":"Upstairs <-> Downstairs Message Version: 11","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.290804692Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}
{"msg":"Repair listens on 0.0.0.0:12810 for path:\"/var/tmp/test_up-build/8810\"","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.296602114Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"repair"}
{"msg":"listening","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.298231449Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"local_addr":"0.0.0.0:12810","task":"repair"}
{"msg":"Using repair address: 0.0.0.0:12810","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.298915952Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}
{"msg":"No SSL acceptor configured","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.298953321Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}
{"msg":"downstairs listening on 0.0.0.0:8810","v":0,"name":"crucible","level":30,"time":"2024-10-07T17:27:21.299080086Z","hostname":"w-01J9M08HMXZRTCH251ZAGJECCQ","pid":1194,"task":"main"}

This suggests the problem is somewhere in the messages between different parts of dsc.

leftwo commented 2 weeks ago

In dsc/src/main.rs, we have the following code that starts a downstairs:

    fn start(&self) -> Result<Child> {                                                   
        println!("Make output file at {:?}", self.output_file);                  
        let outputs = File::create(&self.output_file)                            
            .context("Failed to create output file")?;                           
        let errors = outputs.try_clone()?;                                       

        let port_value = format!("{}", self.port);                               

        let mode = if self.read_only {                                           
            "ro".to_string()                                                     
        } else {                                                                 
            "rw".to_string()                                                     
        };                                                                       

        let region_dir = self.region_dir.clone();                                
        let cmd = Command::new(self.ds_bin.clone())                              
            .args([                                                              
                "run",                                                           
                "-p",                                                            
                &port_value,                                                     
                "-d",                                                            
                &region_dir,                                                     
                "--mode",                                                        
                &mode,                                                           
            ])                                                                   
            .stdout(Stdio::from(outputs))                                        
            .stderr(Stdio::from(errors))                                         
            .spawn()                                                             
            .context("Failed trying to run downstairs")?;                        

        println!(                                                                
            "Downstairs {} port {} PID:{:?}",                                    
            region_dir,                                                          
            self.port,                                                           
            cmd.id()                                                             
        ); 

In our logs above, we see the first message Make outfile at... for this client, but we never see the 2nd, Downstairs {} port {}.... So, that command (which does spawn, we see the output file) never comes back from that spawn, as near as I can determine.

leftwo commented 2 weeks ago

I was able to reproduce it outside of CI using a debug build.

The stacks look mostly similar, but this one stuck out:

 fffff9ffeeb7bf8a read     (20, fffff9ffec3fbc00, 8)
 0000000003add616 std::sys::pal::unix::process::process_inner::<impl std::sys::pal::unix::process::process_common::Command>::spawn::h356b47fec30ee6a3 () + 356
 0000000003ace1ec std::process::Command::spawn::hb5ff4e4d3aae5893 () + 1c
 00000000036eb319 tokio::process::imp::spawn_child::h2e091d8168139f18 () + 29
 00000000036c3ea0 tokio::process::Command::spawn::h82818d972f5bd22b () + 30
 0000000001ebd1ff dsc::DownstairsInfo::start::hdb9e8fbe2e7641c0 () + 4af
 0000000001fc382a dsc::start_ds::{{closure}}::h72fb5d778549a741 () + 39a
 0000000001fc3d27 dsc::ds_start_monitor::{{closure}}::h79d20fed6577f95e () + 1a7
 0000000001fc340b dsc::start_dsc::{{closure}}::{{closure}}::h4710d443c5725ecc () + db
 0000000001ee5cde tokio::runtime::task::core::Core<T,S>::poll::{{closure}}::h25306ec96d735a5a () + 6e
 0000000001ee4d6c tokio::runtime::task::core::Core<T,S>::poll::h323db97a5faf7906 () + 2c
 0000000001f23dcf tokio::runtime::task::harness::poll_future::{{closure}}::h4932b9dce84b5721 () + 3f
 0000000001e8f172 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hcffcc65ef2a830a9 () + 22
 0000000001f99be6 std::panicking::try::do_call::h791a875498f9620a () + 36
 0000000001f94064 std::panicking::try::h1ebe6082f4923d75 () + 44
 0000000001e901e4 std::panic::catch_unwind::h280a3ab6527a6078 () + 14
 0000000001f22563 tokio::runtime::task::harness::poll_future::hcb4b3af1b5128951 () + 73
 0000000001f251d6 tokio::runtime::task::harness::Harness<T,S>::poll_inner::h1765a596355314e0 () + 86
 0000000001f2b3e5 tokio::runtime::task::harness::Harness<T,S>::poll::h9cb104db5087b249 () + 15
 0000000001ecbc2d tokio::runtime::task::raw::poll::hafbc00b5629743bc () + 1d
 0000000003735146 tokio::runtime::task::raw::RawTask::poll::hbcd9b90a9f68e73f () + 26
 0000000003742072 tokio::runtime::task::LocalNotified<S>::run::h82c13fc69d40bfa2 () + 22
 00000000036c898c tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}}::h9ecf202fb4aedd36 () + 1c
 00000000036c8950 tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h9f2957183dfc0a17 () + 120
 00000000036c8634 tokio::runtime::scheduler::multi_thread::worker::Context::run::hb79d7578a4d720cc () + 1b4
 00000000036c8424 tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}}::h947c9cce25841432 () + 34
 000000000370ff9f tokio::runtime::context::scoped::Scoped<T>::set::h589875c319f11e61 () + 5f
 00000000036c547b tokio::runtime::context::set_scheduler::{{closure}}::h946422b065b2537b () + 2b
 00000000036cd4a7 std::thread::local::LocalKey<T>::try_with::hbff304b678572a27 () + b7
 00000000036cbaa1 std::thread::local::LocalKey<T>::with::h7e497ece18cd4db2 () + 11
 00000000036c5440 tokio::runtime::context::set_scheduler::h92d580ae830637b9 () + 30
 00000000036c83d8 tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::h23bf5525e61184da () + c8
 000000000375a83e tokio::runtime::context::runtime::enter_runtime::h1093cfccbd6ee326 () + ae
 00000000036c829b tokio::runtime::scheduler::multi_thread::worker::run::h6416b5163edbc5b8 () + 11b
 00000000036c8171 tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}}::hc6817b4d669b3e32 () + 11
 0000000003707056 <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll::h234967f3a9f321ed () + 56
 00000000036e37a5 tokio::runtime::task::core::Core<T,S>::poll::{{closure}}::h596f8a7421257a1b () + 75
 00000000036e2afc tokio::runtime::task::core::Core<T,S>::poll::h7990afe31ca5173b () + 2c
 00000000036b649f tokio::runtime::task::harness::poll_future::{{closure}}::h2b16ec35846ce065 () + 3f
 0000000003746182 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::hc3c3bbccb1345b38 () + 22
 0000000003767a06 std::panicking::try::do_call::h80b93ac65f9ea34a () + 36
 0000000003763c04 std::panicking::try::h62a93063541a4e4f () + 44
 0000000003751c24 std::panic::catch_unwind::hf4eeead54b5fbc66 () + 14
 00000000036b33b3 tokio::runtime::task::harness::poll_future::h33ea2f16a04c8f82 () + 73
 00000000036a9906 tokio::runtime::task::harness::Harness<T,S>::poll_inner::h2151a92fefbb26a3 () + 86
 00000000036a8dd5 tokio::runtime::task::harness::Harness<T,S>::poll::h6fa656e3a74f94bf () + 15
 000000000373563d tokio::runtime::task::raw::poll::h6c0e9ed682716b6b () + 1d
 0000000003735146 tokio::runtime::task::raw::RawTask::poll::hbcd9b90a9f68e73f () + 26
 0000000003742106 tokio::runtime::task::UnownedTask<S>::run::hd214edb526cc3a17 () + 26
 0000000003702109 tokio::runtime::blocking::pool::Task::run::ha15a3499ed14041a () + 19
 000000000370643c tokio::runtime::blocking::pool::Inner::run::hc0d1077484ccb8f8 () + 11c
 00000000037062c9 tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}::h9b828a85de1189c0 () + 39
 00000000037376ed std::sys_common::backtrace::__rust_begin_short_backtrace::hcde0af8ac6e94548 () + d
 0000000003714d50 std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}::h682a7ee890499fa7 () + 30
 0000000003745470 <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once::h1db3259947d8925b () + 30
 0000000003767bc0 std::panicking::try::do_call::h95bd2eb6539cb660 () + 40
 0000000003762b0b std::panicking::try::h1ff37b4fa23118c9 () + 5b
 0000000003714be8 std::thread::Builder::spawn_unchecked_::{{closure}}::h13ddbdd7ae4a1097 () + 1f8
 00000000036b7c71 core::ops::function::FnOnce::call_once{{vtable.shim}}::hb6cc2aa0206b9ae1 () + 11
 0000000003adce5b std::sys::pal::unix::thread::Thread::new::thread_start::hc6a6dcaabc211891 () + 1b
 fffff9ffeeb74f37 _thrp_setup (fffff9ffeec93a40) + 77
 fffff9ffeeb75280 _lwp_start ()
mkeeter commented 1 week ago

I believe this was falsely closed because you included the phrase fix: #1498 in #1504 !