burnpanck commented 4 months ago

I was now able to find a test-cast highlighting that overflow I suspected in #579. I keep this as a separate issue/PR, so that #579 can be merged before we fix the issue highlighted here. After all, it appears it's still a fairly uncommon issue given that you need a value whose representation times the numerator of a rational conversion factor overflows std::intmax_t.

mpusz commented 4 months ago

579 is merged. You can rebase this one on top of the master.

Do you have an idea of how to improve value_cast here? I don't think that is the case, but maybe the duration_cast specification will help https://wg21.link/time.duration.cast?

burnpanck commented 4 months ago

Let me have a look at the duration_cast specification.

One potential way I see is to always using a double multiplication. The multiplication factor will always remain a compile-time constant, so the compiler is free to elide/replace that with suitable integer arithmetic that gives the same result. Thus, we're basically pushing the issue to the compiler - which I think is a good thing here. Of course, one should do compiler-explorer verification if that's indeed the case, some of which you may have previously done if I read your comment in the other thread correctly. However, one thing we'll lose if we always select the double path is that we may introduce new potential for round-off error. In particular, now it would fail this very same test for the 64-bit types not due to overflow, but due to rounding error (and I have already hand-crafted a corner-case here to highlight that issue). It will only be relevant for conversions involving more than 53 significant bits, which seems rather uncommon. Furthermore, all current conversions involving irrational factors are anyway only accurate to 53 bits. So I think it may be acceptable to just specify that conversion factors do not guarantee more than 53 bits of accuracy or so.

mpusz commented 3 months ago

I am not sure if that is a good idea. First, there are rounding issues, as you mentioned already. Second, the use of a floating-point type may be a no-go on some platforms. Isn't it an issue for your embedded projects?

mpusz commented 2 months ago

Hi @burnpanck! Do you still plan to work on this issue?

You also planned to refactor value_cast for quantity points so it does all operations at once, right? So far, we do not have an issue or PR to track it.

burnpanck commented 2 months ago

Sorry, during rebasing attempt, the PR got accidentally closed.

There is a fundamental tension between accuracy/rounding and performance, and we'd need to compromise one for the other. Fundamentally, to perform the conversion accurately ((val * num) / denom) and without overflow, you need to be able to hold val * num. We now the magnitude of num, but only the range of val. As long as std::numeric_limits<decltype(val)>::max() * num <= std::numeric_limits<std::intmax_t>::max() holds (and correspondingly for ::min()), there is always a representation available which is guaranteed not to overflow. std::intmax_t may not be efficient on the target hardware, but at least the compiler implementation/library provides the best possible implementation for that hardware. In fact, the current implementation matches the specification of std::chrono::duration_cast, which always selects the maximum-width representation for the internal representation. This may in fact be suboptimal for smaller input / output representations if the hardware is unable to perform maximum-width computations efficiently, but an optimising compiler may internally reduce the representation when the values are known.

But when either num gets very large, or val is maximum-width itself, selecting maximum-width for the intermediate representation is not enough and leads to overflow. This is what we see here, and also what std::chrono::duration_cast would do. We could implement an extended-precision multiplication, though that would add about ~50 lines of code (wild guess). Do you think that would be valuable here, given that std::chrono::duration_cast does not implement that?

mpusz commented 2 months ago

It is hard to say. We probably should not produce more assembly instructions than a "naive" code using fundamental types would do. Otherwise, people will complain and not use the library. Also, we always say that the library will never be perfect and will not prevent all the cases where an overflow may happen. If someone wants to avoid that, then / a custom "safe" representation type should be used that, for example, throws on overflow.

burnpanck commented 2 months ago

I created a somewhat more extended discussion here: #599. Whatever the outcome of that discussion is, I believe the current implementation is sub-optimal for integral-types, in that the choice between the floating-point (val * double{ratio}) path and the integral path (val * intmax_t{num} / intmax_t{denom}) depends on if the converting representations are "relatively irrational" to each other, which IMHO is irrelevant. In particular, "relatively irrational" is a sufficient condition for being unable to guarantee an exact conversions, but so is any nonzero denominator. In either case, ultimately, we're asking for a constant fixed-point multiplication, where we potentially may need as many fractional bits as the width of the input $n_i$ , and as many integral bits as the width of the magnitude of the conversion factor $n_f = \lfloor \log_2 \text{ratio} \rfloor$ which is bounded by the width of the output $n_o$ (otherwise all inputs overflow). If we approximate the conversion factor to that many bits, we achieve the full accuracy possible by the input and output representations. For specific conversion factors, we may need less bits though (because the approximation produces zero bits). Finding the best implementation of that multiplication ultimately is the task of the compiler, but we need to choose a way to describe that multiplication to the compiler:

The double writing is super-neat, but only works as long as the sum of those widths is not more than 53 bits ($n_f + n_i \leq 53$). Furthermore, there is the fear that the compiler may insist on the availability of floating-point hardware or miss the optimisation opportunity given the constant factor to turn it into integral operations, which may be a no-go.
According to the Dirichlet's approximation theorem, the writing as rational multiplication (val * intmax_t{num} / intmax_t{denom}) can approximate the conversion factor sufficiently precise if $n_f + 2n_i \leq 64$. Furthermore, similar to the double case, the overflow risk can be reduced at the cost of lost precision by choosing a smaller numerator. In general, compilers are expected to be able to turn the division into a fixed-point multiplication where beneficial.
Finally, a direct integer implementation of fixed point-multiplication ((val * std::intmax_t{ratio*fpow(2,n_i)}) >> n_i ) can always approximate the conversion factor sufficiently precise if $n_f + n_i \leq 64$. This can be extended to higher precision using wider intermediate types (at the output of the multiplier), where, depending on the hardware, this may require up to 4 multiplications of a narrower type - but many instruction-sets provide intrinsics; see e.g. this question on SO. Note that this 4x overhead is the same one your get on 32 bit architectures when using a 64 bit integer multiplication width.

I'm inclined to suggest an implementation based on direct fixed-point multiplication, which gives the most accurate results with good confidence on reasonable assembly even without optimisation. For 32-bit input and output representations, this can be written very portably using a single 64 bit multiplication and a bit-shift. I believe there is no downside to this. I further believe that we should spend that 4x multiplication effort for 64 bit input types, because if we don't, we basically can't guarantee anything for general conversion factors. The only exceptions that I would implement is sticking to pure multiplication or division if the conversion factor either has a numerator or denominator which is equal to one. What do you think?

burnpanck commented 2 months ago

I experimented with the above three implementation strategies for constant multiplication; see Godbolt. In the following, I list assembly instruction counts for a couple of platforms all compiled with GCC 14.1 (though it looks similar on Clang), tested for various integer widths. Note that both the ratio implementation as-well as the fixed-point implementation make use of double-width intermediate types to guarantee full accuracy at the full value range. The existing implementation should be very close to the 32bit signed ratio implementation, because that one uses 64bit internal multiplications like the existing potentially overflowing implementation does. The gist of it:

On 64bit x86-64:
- double: 12 instructions except for 64 bit unsigned int, where it's 31 instructions.
- ratio: 9 to 12 instructions up to 32 bit ints, 26/33 instructions for unsigned/signed 64 bit integers.
- fixed-point: <= 4 instructions up to 32 bit ints, 5/13 instructions for unsigned/signed 64 bit integers.
On 32bit x86-64:
- double: 13-15 instructions except for 64 bit unsigned int, where it's 41 instructions.
- ratio: 6 to 8 instructions up to 16bit ints, 31/34 instructions for 32 bit ints, and I didn't implement 128 bit division needed for correctly handle 64 bit ints.
- fixed-point: 3 instructions up to 16 bit ints, 5/13 instructions for 32 bit ints and 62/76 instructions for unsigned/signed 64 bit integers (clang 18.1 gets that one at 47 instrs).
On 64bit ARM:
- double: It appears that all floating-point is delegated to a library function
- ratio: 9 to 10 instructions up to 32 bit ints, 22/24 instructions for unsigned/signed 64 bit integers.
- fixed-point: <= 6 instructions up to 32 bit ints, 8/10 instructions for unsigned/signed 64 bit integers.
On 32bit ARM:
- double: All integer widths use software arithmetic to convert to double, do the double multiplication, and to convert back to int. Likely tons of instructions.
- ratio: 6 intructions up to 16 bit integers, after that software arithmetic to do the division. Tons of instructions.
- fixed-point: From 4 instructions for up to 16 bit integers, 8 instructions for 32 bit integers to 27 instructions for 64 bit integers, the latter requiring 4 multiplication instructions.

Thus, to summarise, the proposed fully-accurate fixed-point implementation without overflow beats the existing implementation almost everywhere (sometimes very significantly), even though the existing one is potentially overflowing for 64 bit inputs. The exception seems to be 32 bit on x86-64, where the proposed implementation is twice as large for 64 bit inputs (but in contrast to the previous implementation guarantees correctness). On the other hand, for 32 bit integers, it beats the existing implementation by a factor of 4.

Based on those results, I strongly suggest we just guarantee correctness for all representable values, and use the fixed-point implementation whenever the conversion is not a pure integer multiplication or division.

mpusz commented 2 months ago

What you write is really interesting but I am unable to give any meaningful feedback as I am on vacation until the mid-August. I will dig into it a bit more if I find some time in the next two weeks, or I will return to you after my vacation.

burnpanck commented 2 months ago

Don't worry, enjoy your vacation! I will attempt to complete #598 with full correctness guarantees under the assumption that the the quantity rescaling provides the same guarantees, but of course it would "work" even with a different resolution here - it will just become even more difficult to figure out when overflow will happen.

mpusz commented 2 weeks ago

Hi @burnpanck! Do you plan to work on those in the near future. I plan to release mp-units 2.3 soon.

burnpanck commented 2 weeks ago

I generally do, but you know how it is :-). I will be pretty busy the coming week and won't have time the weekend after, so it's either going to be this weekend or then rather two weeks from now, or we'll get close to October.

burnpanck commented 2 weeks ago

I just found github-action-benchmark, which runs benchmarks as part of CI, and is able to push graphs of the performance trend to GitHub Pages. It also appears to support Catch2 out of the box; thus, this may be a valuable tool to keep an eye on the effect caused by changes to core functionality as what we're touching here.

mpusz commented 2 weeks ago

It looks really interesting indeed

mpusz / mp-units

Bug: overflow in value_cast #580

579 is merged. You can rebase this one on top of the master.