peterbourgon / diskv

A disk-backed key-value store.
http://godoc.org/github.com/peterbourgon/diskv
MIT License
1.41k stars 102 forks source link

Make the strict memory limit optional #61

Open eric opened 3 years ago

eric commented 3 years ago

We would rather not have a panic() if the memory limit is exceeded due to concurrent access.

This makes the panic optional.

peterbourgon commented 3 years ago

Hmm, I don't think this PR does what you think it does — that panic indicates a reasonably serious error, I think. Can you say a bit more what you're trying to get out of this?

eric commented 3 years ago

Haha, you're totally right. I ended up trying to fix the wrong panic. It's actually this panic that we are running into regularly with errors like:

159451 bytes still won't fit in the cache! (max 10485760 bytes)

As I look at this problem again, I'm feeling more confused than I was before. Previously I was thinking we were running into a race condition, but I see that the caller of ensureCacheSpaceWithLock() takes out a write lock, so there shouldn't be any way that this panic happens if everything has been deleted from the cache, right?

peterbourgon commented 3 years ago

Yes, I can't see how that would manifest. Confusing. Would you be willing to try and reproduce with a bit more info added to the panic?

- panic(fmt.Sprintf("%d bytes still won't fit in the cache! (max %d bytes)", valueSize, d.CacheSizeMax))
+ panic(fmt.Sprintf("%d bytes still won't fit in the cache! (current %d objects and %d bytes, max %d bytes)", valueSize, len(d.cache), d.cacheSize, d.CacheSizeMax)) 
eric commented 3 years ago

I've adapted it to have an option to turn all of these panics into errors and am also adding more to the message so we can diagnose what's going on.

peterbourgon commented 3 years ago

I appreciate the pragmatism of the PR, but the panic is a panic and not an error for a reason :) as this condition appears to have identified a serious bug somewhere. ~If you can't try out the patch I suggested above, can you provide a reproducible test case, or maybe tell me how to build one myself?~

edit: Oops, missed your second clause somehow. I'll wait for more details from you 👍

eric commented 3 years ago

We have this deployed on consumer devices and it only happens infrequently. We don't have any steps to reproduce it.

It's much better for us to eventually stop caching values and quietly return an error than it is to take down the process and stop working entirely.

I agree that it appears there's a real bug here, but we need something to make it so that it doesn't crash, or stop using diskv entirely until we understand what the bug is.