Option::unwrap() on a None value in cache_set

lolgesten commented 6 years ago

We're caching some results of looking up data in S3, and every few days we get a panic in cached that poisons the internal mutex.

Our cached definition is this:

cached_key_result! {
    QUERY: SizedCache<String, Vec<Inventory>> = SizedCache::with_size(100);
    Key = { format!("{}/{}/{}", region, bucket, recording_id) };
    fn cached_query(region: &str, bucket: &str, recording_id: &str) -> Result<Vec<Inventory>> = {
        match do_query(region, bucket, recording_id) {
            Ok(v) => if v.is_empty() {
                Err(InventoryError::new(400, "No match"))
            } else {
                Ok(v)
            },
            Err(e) => Err(e)
        }
    }
}

The crash is this where cached is on line 9:

Sep 18 21:29:44: thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', libcore/option.rs:345:21
Sep 18 21:29:44: note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Sep 18 21:29:44: stack backtrace:
Sep 18 21:29:44:    0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
Sep 18 21:29:44:              at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
Sep 18 21:29:44:    1: std::sys_common::backtrace::print
Sep 18 21:29:44:              at libstd/sys_common/backtrace.rs:71
Sep 18 21:29:44:              at libstd/sys_common/backtrace.rs:59
Sep 18 21:29:44:    2: std::panicking::default_hook::{{closure}}
Sep 18 21:29:44:              at libstd/panicking.rs:211
Sep 18 21:29:44:    3: std::panicking::default_hook
Sep 18 21:29:44:              at libstd/panicking.rs:227
Sep 18 21:29:44:    4: std::panicking::rust_panic_with_hook
Sep 18 21:29:44:              at libstd/panicking.rs:511
Sep 18 21:29:44:    5: std::panicking::continue_panic_fmt
Sep 18 21:29:44:              at libstd/panicking.rs:426
Sep 18 21:29:44:    6: rust_begin_unwind
Sep 18 21:29:44:              at libstd/panicking.rs:337
Sep 18 21:29:44:    7: core::panicking::panic_fmt
Sep 18 21:29:44:              at libcore/panicking.rs:92
Sep 18 21:29:44:    8: core::panicking::panic
Sep 18 21:29:44:              at libcore/panicking.rs:53
Sep 18 21:29:44:    9: <cached::stores::SizedCache<K, V> as cached::Cached<K, V>>::cache_set
Sep 18 21:29:44:   10: recoordinator::inventory::s3::query
Sep 18 21:29:44:   11: recoordinator::inventory::s3::query_one
Sep 18 21:29:44:   12: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &'a mut F>::call_once
Sep 18 21:29:44:   13: <&'a mut I as core::iter::iterator::Iterator>::next
Sep 18 21:29:44:   14: <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter
Sep 18 21:29:44:   15: recoordinator::reel::to_desc::reel_to_desc
Sep 18 21:29:44:   16: recoordinator::dispatch
Sep 18 21:29:44:   17: std::panicking::try::do_call
Sep 18 21:29:44:   18: __rust_maybe_catch_panic
Sep 18 21:29:44:              at libpanic_unwind/lib.rs:105

This is probably from one of these two rows:

https://github.com/jaemk/cached/blob/0ed46ce9d6754da572d38617547a7b6b421007a9/src/stores.rs#L124-L125

I guess one of these unwrap() assumptions doesn't hold true for us?

lolgesten commented 6 years ago

And I just had another crash running with my changes.

thread '<unnamed>' panicked at 'lru_key missing in store', libcore/option.rs:960:5

So it's the first unwrap() that falls over.

jaemk commented 6 years ago

Weird... thanks a bunch for investigating this

jaemk commented 6 years ago

I must have messed up some logic and there's duplicate keys present in the order list

jaemk commented 6 years ago

@lolgesten So there's a racy case - if a cached function is called twice for some long execution with the same key, the same key will be set in the cache twice and the key will be prepended to the order list twice. Once these keys get pushed to the end and are evicted, we hit the panic trying to remove the same key from the hashmap twice.

I can put out a patch later tonight w/ a fix

lolgesten commented 6 years ago

@jaemk Brilliant! Thanks! Problem was a bit more complicated than my patch :P

I'm deploying this to our production servers now. The race happened quite infrequently for us, like once every 5 days. So definite confirmation will take time.

However i think we can close this issue since it seems very likely you fixed it.

jaemk commented 6 years ago

@lolgesten Awesome - Thanks again for reporting and investigating this! Please re-open if the issue crops up again!

jaemk / cached

Option::unwrap() on a None value in cache_set #7