metrics-rs / quanta

high-speed timing library in Rust
MIT License
302 stars 31 forks source link

Intermittent panic due to overflowing our source calibration denominator. #61

Closed tobz closed 2 years ago

tobz commented 2 years ago

This was reported in an old issue in the metrics crate -- metrics-rs/metrics#230 -- but essentially the user was hitting the line where we hypothesized that something was amiss if we managed to wrap around when calling u64::next_power_of_two.

Thinking on this some more, one potential explanation is that the user had different TSC offsets on different cores, and somehow they were hitting a case where the source measurement happening in adjust_cal_ratio was actually on a core where the TSC value was smaller than the value taken initially via Counter::now in calibrate.

That would seemingly explain how the end - start calculation could yield a number that would cause next_power_of_two to overflow, and as long as the absolute delta was smaller than (2^64)-(2^63), we'd always end up with a value that would trigger that overflow.

The bigger question might be: why did this user's set-up somehow manage to trigger this behavior, even at the quoted "maybe once out of 20 times" rate, when quanta is used in many applications that never seem to experience this?

tobz commented 2 years ago

Digging a little further, and with some painfully obvious search terms, I discovered this forum discussion about unexpectedly high TSC warp/deviation specifically on Thinkpads with Ryzen mobile CPUs: https://forums.lenovo.com/t5/Other-Linux-Discussions/Clocksource-falling-back-to-HPET-bec-TSC-is-unstable-ThinkPad-E14-Gen-2-maybe-others/m-p/5036464?page=1#5155358. This matches the system description given by the user in the issue on the metrics repo.

In this case, the issue is one core (CPU #0) having a TSC offset that was much larger than the other cores, seemingly related to the BIOS doing weird things.

All of that said: having rubber ducked it when I wrote up this ticket, any calculation of end.wrapping_sub(start) where the values had an absolute difference of less than 2^63 should return a value that would trigger the overflow. So either the TSC offset was, in fact, that large, or this user was getting somewhat unlucky with thread scheduling every so often by hitting CPU #0 where the TSC warp was in place, but the rest of the cores were identically synchronized?

It's even more fuzzy because the TSC should always be starting from 0, on all cores, as I understand it... but maybe that understanding is incorrect. This could imply that we may actually want to specifically abort calls to adjust_cal_ratio where our measurement is non-monotonic, knowing that we may have hit another core that's warped in the wrong direction or otherwise not synchronized somehow.

More TSC reading required, and maybe it's time to work on #18, too.

BrynCooke commented 2 years ago

I think that I may be hitting this when using the mocka library and have pasted a stack trace at https://github.com/moka-rs/moka/issues/113.

My Lenovo laptop is AMD based, and it happens pretty quickly.

It also occurred once on a CircleCI run.

My colleagues, who have M1 and Intel CPUs have not been able to reproduce.

tobz commented 2 years ago

Argh, yeah, that's not great.

I'll see if I can work on a change that bails out if we detect that the TSC is unstable, since falling back to the OS timing primitives is all we can reasonably do.

tobz commented 2 years ago

@tatsuya6502 I've published a new release: quanta@v0.10.0.

Ironically, I had already merged a fix for this particular panic, but forgot to cut a new release. 🤦🏻

I'm going to close this issue for now because this should definitively fix the given panic, but if anything still looks amiss.. feel free to reopen this issue, or create a new one.

tatsuya6502 commented 2 years ago

Thanks @tobz for fixing the issue and publishing quanta@v0.10.0! I will update my library (moka) to use v0.10.0 so that @BrynCooke can try it out.