oxidecomputer / crucible

A storage service.
Mozilla Public License 2.0
175 stars 18 forks source link

downstairs panics trying to get closed extent #900

Open leftwo opened 1 year ago

leftwo commented 1 year ago

During a test of live repair, this panic was hit:

thread 'tokio-runtime-worker' panicked at 'attempting to get closed extent 9', downstairs/src/region.rs:1789:17

Something is trying to access an extent that is not open:

    async fn get_opened_extent(&self, eid: usize) -> Arc<Extent> {                       
        match &*self.extents[eid].lock().await {                                         
            ExtentState::Opened(extent) => extent.clone(),                               
            ExtentState::Closed => {                                                     
                panic!("attempting to get closed extent {}", eid)                        
            }                                                                            
        }                                                                                
    }
leftwo commented 1 year ago

Saw this during the testing to reproduce: https://github.com/oxidecomputer/crucible/issues/893 It's possibly the same root cause.

leftwo commented 1 year ago

On a recent case of this:

{"msg":"request completed","v":0,"name":"crucible","level":30,"time":"2023-08-30T06:51:32.253337601Z","hostname":"atrium","pid":12671,"uri":"/newextent/0/data","method":"GET","req_id":"a96ac809-20e6-41dd-abbf-d774c8dcef20","remote_addr":"127.0.0.1:59666","local_addr":"0.0.0.0:13910","task":"repair","response_code":"200"}
{"msg":"request completed","v":0,"name":"crucible","level":30,"time":"2023-08-30T06:51:32.25780714Z","hostname":"atrium","pid":12671,"uri":"/newextent/0/db","method":"GET","req_id":"1b1d8358-43b4-4357-8670-3c73a88f9c07","remote_addr":"127.0.0.1:59666","local_addr":"0.0.0.0:13910","task":"repair","response_code":"200"}
{"msg":"39331 job Write for connection UpstairsConnection { upstairs_id: f89cccb1-0b5e-4e0b-8220-f1968545ea11, session_id: cd32a8ed-bc84-4d24-b76f-f5c9029041d6, gen: 1 } waiting on 4 deps","v":0,"name":"crucible","level":40,"time":"2023-08-30T06:51:32.27557186Z","hostname":"atrium","pid":12671,"role":"work"}
{"msg":"39331 job Write for connection UpstairsConnection { upstairs_id: f89cccb1-0b5e-4e0b-8220-f1968545ea11, session_id: cd32a8ed-bc84-4d24-b76f-f5c9029041d6, gen: 1 } waiting on 3 deps","v":0,"name":"crucible","level":40,"time":"2023-08-30T06:51:32.281916013Z","hostname":"atrium","pid":12671,"role":"work"}
thread 'tokio-runtime-worker' panicked at 'attempting to get closed extent 1', downstairs/src/region.rs:1793:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

So, possibly the write that is trying to access a closed extent?:

mkeeter commented 1 year ago

We saw something possibly related during this buildomat run for #918. It's not reproducible, so I don't think it's an inherent issue with that PR.

Here's the assertion, which triggers when a block read comes back with non-zero data, but without a context:

Region: es:20 ec:30 bs:4096  ts:2457600  tb:600  max_io:256 or 1048576
Loading write count information from file "/tmp/test_repair_verify.data"
Read and Verify all blocks (0..600 range:true)
thread 'tokio-runtime-worker' panicked at 'assertion failed: response.data[..].iter().all(|&x| x == 0)', upstairs/src/lib.rs:3878:13

This is weird, because it happens after repair is theoretically complete. This is unlike previous assertion failures in this issue, which happen in the middle of stuff; it may have a different root cause.