Closed tobz closed 2 years ago
Digging a little further, and with some painfully obvious search terms, I discovered this forum discussion about unexpectedly high TSC warp/deviation specifically on Thinkpads with Ryzen mobile CPUs: https://forums.lenovo.com/t5/Other-Linux-Discussions/Clocksource-falling-back-to-HPET-bec-TSC-is-unstable-ThinkPad-E14-Gen-2-maybe-others/m-p/5036464?page=1#5155358. This matches the system description given by the user in the issue on the metrics
repo.
In this case, the issue is one core (CPU #0) having a TSC offset that was much larger than the other cores, seemingly related to the BIOS doing weird things.
All of that said: having rubber ducked it when I wrote up this ticket, any calculation of end.wrapping_sub(start)
where the values had an absolute difference of less than 2^63 should return a value that would trigger the overflow. So either the TSC offset was, in fact, that large, or this user was getting somewhat unlucky with thread scheduling every so often by hitting CPU #0 where the TSC warp was in place, but the rest of the cores were identically synchronized?
It's even more fuzzy because the TSC should always be starting from 0, on all cores, as I understand it... but maybe that understanding is incorrect. This could imply that we may actually want to specifically abort calls to adjust_cal_ratio
where our measurement is non-monotonic, knowing that we may have hit another core that's warped in the wrong direction or otherwise not synchronized somehow.
More TSC reading required, and maybe it's time to work on #18, too.
I think that I may be hitting this when using the mocka library and have pasted a stack trace at https://github.com/moka-rs/moka/issues/113.
My Lenovo laptop is AMD based, and it happens pretty quickly.
It also occurred once on a CircleCI run.
My colleagues, who have M1 and Intel CPUs have not been able to reproduce.
Argh, yeah, that's not great.
I'll see if I can work on a change that bails out if we detect that the TSC is unstable, since falling back to the OS timing primitives is all we can reasonably do.
@tatsuya6502 I've published a new release: quanta@v0.10.0
.
Ironically, I had already merged a fix for this particular panic, but forgot to cut a new release. 🤦🏻
I'm going to close this issue for now because this should definitively fix the given panic, but if anything still looks amiss.. feel free to reopen this issue, or create a new one.
Thanks @tobz for fixing the issue and publishing quanta@v0.10.0
! I will update my library (moka
) to use v0.10.0
so that @BrynCooke can try it out.
This was reported in an old issue in the
metrics
crate -- metrics-rs/metrics#230 -- but essentially the user was hitting the line where we hypothesized that something was amiss if we managed to wrap around when callingu64::next_power_of_two
.Thinking on this some more, one potential explanation is that the user had different TSC offsets on different cores, and somehow they were hitting a case where the source measurement happening in
adjust_cal_ratio
was actually on a core where the TSC value was smaller than the value taken initially viaCounter::now
incalibrate
.That would seemingly explain how the
end - start
calculation could yield a number that would causenext_power_of_two
to overflow, and as long as the absolute delta was smaller than (2^64)-(2^63), we'd always end up with a value that would trigger that overflow.The bigger question might be: why did this user's set-up somehow manage to trigger this behavior, even at the quoted "maybe once out of 20 times" rate, when
quanta
is used in many applications that never seem to experience this?