r-quantities / units

Measurement units for R
https://r-quantities.github.io/units
175 stars 28 forks source link

vec_cast.units.units() can fail if one underlying vector is integer #324

Closed hughjonesd closed 1 year ago

hughjonesd commented 2 years ago

Example:

x <- set_units(1:10, cm)
br <- set_units(2:4, `in`)
vctrs::vec_cast_common(x, br)
Error:
! Can't convert from `..2` <double> to <integer> due to loss of precision.
• Locations: 1, 2, 3

Similarly:

x_m <- set_units(1:10, m)
br_ft <- set_units(2:6, ft)
vctrs::vec_cast_common(x_m, br_ft)
# fails with the same error message

By contrast:

x <- set_units(1:10, cm)
br_ft2 <- set_units(2:4/12, ft)
vctrs::vec_cast_common(x, br_ft2)

works. So does:

> x_in <- set_units(1:10, `in`)
> vctrs::vec_cast_common(x_in, br)

and:

x_dbl <- set_units(as.double(1:10), cm)
vctrs::vec_cast_common(x_dbl, br)

Lastly, you avoid the bug if the double happens to evaluate to integer values:

x <- set_units(1:10, cm)
br2 <- set_units(2:4 / 2.54, `in`)
vctrs::vec_cast_common(x, br2)
# works

The issue seems to be that units::vec_cast.units.units() recalls vec_cast with the underlying bare data. But if one of these is integer, and the other is double and is not integer-valued (e.g. because it's been converted), that fails.

This may make sense for reasons that I don't understand. It was surprising to me, though.

I think the issue is that vctrs correctly wants to fail if it has to cast a non-integer-valued double to an integer. But in the units case, we can be sure that the integer values are "really" doubles, i.e. they are on the real number line (... right?) If so, maybe a fix would be to add to_bare <- as.Double(to_bare) in vec_cast.units.units().

Package versions: units 0.8.0, vctrs 0.4.1, R 4.2.0.

Enchufa2 commented 2 years ago

Thanks for the analysis. The question here is why are we recalling the method on the bare data. Is there any corner case I don't see in which this is required, @lionel- ?

https://github.com/r-quantities/units/blob/07b2606f0bfdaec526bb1016dcfb85aa997c4a18/R/tidyverse.R#L72

Can't the method finish 4 lines before, when the conversion was already made?

lionel- commented 2 years ago

yup it could. I thought it'd be a good thing to benefit from vctrs casting rules for bare types but maybe not?

Enchufa2 commented 2 years ago

For units, we care about the mode (numeric) but not about the storage mode. From what I can see, ptype2 is about the mode, and cast takes into account the storage mode. Is that right? So, unless there's some benefit I don't see, I'll just drop that part and adjust the tests accordingly.

lionel- commented 2 years ago

hmm... Actually it's important for vctrs that after a cast from x to y, the cast result has the same storage mode as y.

Alternatively, you could provide a proxy method for your class that always uses the same storage mode, that would work too.

Enchufa2 commented 2 years ago

Ok. The proxy thing is too much trouble I think. How about forcing the storage mode of the destination? See https://github.com/r-quantities/units/commit/85c9785c68459430bac3e8d342276bfacce85462

hughjonesd commented 2 years ago

Can confirm this fixes my underlying bug.

lionel- commented 2 years ago

If you have an identity proxy, the storage mode of the return value of a cast method must match the storage mode of to. So if to is not stored as a double, this cast method will cause internal errors in some cases.

Enchufa2 commented 2 years ago

Can you think of an example of such errors, so that we can use it as a test case?

lionel- commented 2 years ago

If x has integer storage and y has double storage, and the cast method doesn't normalise that, and there is an identity proxy, I believe vec_c(x, y) should fail.

Enchufa2 commented 2 years ago

It works fine with the commit above:

x = units::set_units(1:3, "cm")
y = units::set_units(4.0, "m")
storage.mode(x)
#> [1] "integer"
storage.mode(y)
#> [1] "double"
vctrs::vec_c(x, y)
#> Units: [cm]
#> [1]   1   2   3 400
vctrs::vec_c(y, x)
#> Units: [m]
#> [1] 4.00 0.01 0.02 0.03
lionel- commented 2 years ago

oh yeah, because the common type of int + dbl == dbl, and your cast method always converts to dbl.

But here is an example where it breaks:

vctrs::vec_assign(x, 2, y)
#> Error in `vctrs::vec_assign()`:
#> ! `proxy` of type `integer` incompatible with `value` proxy of type `double`.
#> ℹ In file slice-assign.c at line 156.
#> ℹ This is an internal error in the vctrs package, please report it to the package authors.
#>     ▆
#>  1. ├─vctrs::vec_assign(x, 2, y)
#>  2. └─rlang:::stop_internal_c_lib(...)
#>  3.   └─rlang::abort(message, call = call, .internal = TRUE, .frame = frame) at rlang/R/utils.R:212:2

In general, not respecting the storage mode in a vctrs cast method is UB.

hughjonesd commented 1 year ago

I take it back; my bug is still there using units 0.8.1 (and vctrs 0.6.2).

x <- set_units(1:10, cm)
br <- set_units(2:4, `in`)
vctrs::vec_cast_common(x, br)
Error:
! Can't convert from `..2` <double> to <integer> due to loss of precision.
• Locations: 1, 2, 3
Enchufa2 commented 1 year ago

Yes, because I never merged the commit above per Lionel's last comment, and I didn't have the time to further look into this, apologies.

hughjonesd commented 1 year ago

So is there a reason not to just convert manually to double? My thoughts, FWIW:

Enchufa2 commented 1 year ago

You mean at the entry point? Looking at the udunits2 interface, it works with floats only, so I suppose there's no point in supporting integers

Enchufa2 commented 1 year ago

Could you please test the commit above?

Enchufa2 commented 1 year ago

Unfortunately, this breaks a couple of (not robust enough) tests in {quantities}, so I'll have to run revdep checks.

hughjonesd commented 1 year ago

This seems to work. It fixes the bug reported here, and also the underlying bug from hughjonesd/santoku#40:

library(santoku)
x <- set_units(1:10, cm)
br <- set_units(2:4, `in`)
chop(x, br)
 [1] [ 1.00 [cm],  5.08 [cm]) [ 1.00 [cm],  5.08 [cm]) ...
Levels: [ 1.00 [cm],  5.08 [cm]) [ 5.08 [cm],  7.62 [cm]) [ 7.62 [cm], 10.16 [cm]]

When I first compiled the fix, in the same session, the above code gave me:

x <- set_units(1:10, cm)
br <- set_units(2:4, `in`)
chop(x, br)
Error in .Call("_units_ud_compare", PACKAGE = "units", x, y, xn, yn) : 
  "_units_ud_compare" not available for .Call() for package "units"

but I think that was just a temporary thing.

hughjonesd commented 1 year ago

I uncovered one other error while testing this:

x1 <- set_units(0, "degree_Celsius")
x2 <- set_units(3, "degree_Celsius")
x1 < x2

Error in ud_compare(e1, e2, as.character(units(e1)), as.character(units(e2))) : 
  ut_scale(): NULL factor argument
Enchufa2 commented 1 year ago

Interesting. New issue #346 set to track this.

hughjonesd commented 1 year ago

Thank you!

bart1 commented 1 year ago

@Enchufa2 I just encountered this issue as I was manually creating units (sometimes integers) that were converted to doubles. From what I gather here int units are not supported? That would be a pity as it would save a lot of memory in some cases and could make sense for for example counts.

Enchufa2 commented 1 year ago

From what I gather here int units are not supported?

Correct. Unfortunately this was causing trouble, so since v0.8-2, storage mode is force to double on entry point.

That would be a pity as it would save a lot of memory in some cases and could make sense for for example counts.

On the one hand, if you have a big(ish) data application in which you are having memory issues, then probably computing with units is too expensive too, so you should resort to specialized tools.

On the other hand, there are only a few count units, and even counts may require doubles eventually:

> set_units(set_units(1, bit), byte)
0.125 [byte]
bart1 commented 1 year ago

Thanks for your clarification @Enchufa2 I will just resort to always use doubles

edzer commented 1 year ago

For counts of things the package is not of much use, as they are treated as unitless and add to anything else unitless:

library(units)
# udunits database from /usr/share/xml/udunits/udunits2.xml
set_units(3, bytes) + set_units(4, rad)
# 3.5 [bytes]

You'd like to have a package that distinguishes things, e.g. using an ontology or nomenclature or taxonomy, and that can aggregate to superclasses, e.g. 3 apples + 4 oranges = 7 pieces of fruit, or 3 European herring gulls + 2 black-headed gulls = 5 gulls, and 5 gulls + 2 chaffinches = 7 birds.