rjeczalik / notify

File system event notification library on steroids.
MIT License
900 stars 128 forks source link

Zero action causing "panic: notify: cannot decode internal mask" #197

Closed imsodin closed 3 years ago

imsodin commented 3 years ago

The panic "notify: cannot decode internal mask" re-occurred since deploying notify with the additional info in the panic message (https://github.com/rjeczalik/notify/pull/194). The action code is 0. The full trace can be found here: https://crash.syncthing.net/report/a269c4d92ca4513fc34c79277ea3eaeb42ab7ca9a561899639895ef404570e48

Unless the cause of this is know and fixable (@ppknap ?) it's probably best to just drop events with a zero action instead of panicking, agreed?

ppknap commented 3 years ago

hey @imsodin ! Indeed, it's likely that we are hitting "leftovers" from the buffer now. We can try skipping the panic(or rather printing the msg in dbg mode and then returning). However, I'm not sure what is the Win completion port status when this happens, we might run into infinite loops if CPs are not able to recover.

I will send a PR with the patch but will also need someone to stress test that on Windows before we move forward.

imsodin commented 3 years ago

So there's a possibility that instead of panicking, it will just silently go into a infinite busy-loop generating zero events instead, right?

The problem is the stress testing done here is releasing it to all of Syncthing users (a lot) and waiting for anonymous crash reports. So if infinite loops are a concern, I'd propose (temporarily) adding some counter/timer that still causes a panic if too many zero events happen back-to-back in some short interval. I believe a panic is preferable to silently looping. Does that make any sense?

And just for context: We are talking about ~600 reported occurrence by ~150 users in the last 14 days, out of minimum 25k active and reporting users (likely quite a bit more, we don't have stats on people with crash reporting enabled, at least not easily available).

imsodin commented 3 years ago

This has been fixed by https://github.com/rjeczalik/notify/pull/198 - thanks!