Support for NA_units_ - Githubissues

billdenney commented 6 years ago

I just had a bug in my code, but it was hard to track down due to an unclear error.

What I just did was, in essence, the following:

> set_units(NA_real_, NA_character_, mode="standard")
Error: is.language(x) is not TRUE

I thought that I was getting the error because I wasn't using mode="standard", but it really was due to the NA_character_.

I have two thoughts:

If both the x and value options to set_units are NA, could NA be returned? (That could easily be a bad idea for several reasons, and I don't have a strong opinion if the answer is a resounding no.)
Could the Error: is.language(x) is not TRUE error be clarified? I thought I was making it clear that I wasn't sending in a language object when I gave NA_character_, so the error didn't make sense to me. (Admittedly, I wasn't intentionally sending in NA_character_.)

A simple fix for option 2 above could possibly be changing the stopifnot(is.character(x), length(x) == 1) at the top of as_units.character to stopifnot(is.character(x), length(x) == 1, is.na(x)).

billdenney commented 6 years ago

As I was fuzzing the interface after the above error, I found that this error could also be clarified:

set_units(NA_real_, NA, mode="standard")
Error in if (any(unclass(value) != 1)) warning(paste("numeric value",  : 
  missing value where TRUE/FALSE needed

Enchufa2 commented 6 years ago

IMO, a missing value should be valid; a missing unit... has no meaning. So I would like to get an error, but definitively a more informative one. NAs and NULLs are tricky when it comes to if statements. Thanks for spotting it!

edzer commented 6 years ago

(sorry, committed this in a branch; will soon be merged)

billdenney commented 6 years ago

As I'm working with more real data, I'm finding that this is a common issue for me. I will relatively often have a vector where there are instances of combined NA as both the magnitude and the units.

What do you think about defining a new constant, NA_units_ (similar to what is done in lubridate for dates).

In effect, NA_units_ would allow for NA as the value for the unit part while normal units would not allow that. It would require that the magnitude part is NA. All mathematical operations (+, -, *, /, and the % relatives) with NA_units_ would return NA_units_.

Enchufa2 commented 6 years ago

Could you provide an example? Are there any instances of a non-NA number with a NA unit? Because, otherwise, we could simply support NA in Ops.units by returning NA if any operand is NA.

billdenney commented 6 years ago

What I'm working with right now is clinical laboratory data in a clinical study. When a test isn't done for some reason (for example if the patient didn't come in for a scheduled visit), then the value and units for the test are explicitly missing (like d_missing below).

Immediately, I'm just trying to setup the data, and it looks like the following:

d_no_missing <-
  data.frame(Patient_ID=1:3,
             Lab_Name=c("Glucose", "Cholesterol", "Cholesterol"),
             Lab_Value=c(5, 100, 200),
             Lab_Units=c("mmol/L", "mg/dL", "mg/dL"),
             stringsAsFactors=FALSE)
d_missing <- d_no_missing
d_missing[2, c("Lab_Value", "Lab_Units")] <- NA

d_no_missing$Value_with_Units <-
  mixed_units(d_no_missing$Lab_Value,
              d_no_missing$Lab_Units,
              mode="standard")
d_missing$Value_with_Units <-
  mixed_units(d_missing$Lab_Value,
              d_missing$Lab_Units,
              mode="standard")

Enchufa2 commented 6 years ago

I see. Then one path would be to enable missing units. It requires some rework of the functions that manage the units attribute, but everything else would work seamlessly. Another path would be to return plain NAs and make units work with them. Let's see what @edzer thinks.

edzer commented 6 years ago

Great thought. I see the convenience for your use case, but I'm not convinced we should require for NA_units_ units that the magnitude is only allowed to be NA. I see numeric values in units computations as actual values with implicitly missing units. If we allow for NA_units_,

> set_units(1, km/h) * 2
2 [km/h]

should actually return 2 [NA].

Enchufa2 commented 6 years ago

I'm not convinced we should require for NA_units_ units that the magnitude is only allowed to be NA.

Me neither. If NA_units_ is allowed, we should go for the whole pack, and allow this to be just another (missing) unit. But

set_units(1, km/h) * 2 should actually return 2 [NA].

I'm not convinced about this. It is quite a change. Where is the boundary between unitless and missing units? It is not clear to me. I wouldn't change the behaviour based on a fuzzy boundary just to support a corner case. Moreover, unitless is a units concept, but missing is an R concept; so unitless is more general for this topic. Therefore, I would default to unitless (current behaviour) unless NA is explicitly specified.

In other words, for me, 2 (isolated, by default) has no units. Unless I have more information; unless I know that there is some unit, but I don't know which one it is, so I explicitly should have to specify set_units(2, NA).

edzer commented 6 years ago

I agree that would be quite a change, and I was thinking about making this optional. This

> set_units(1, km/h) * 1
1 [km/h]
> set_units(1, km/h) + 1
Error in Ops.units(set_units(1, km/h), 1) : 
  both operands of the expression should be "units" objects

however annoys me. unitless is a unit; no unit implies a missing unit. Having the option to catch (and break on) the case where plain numbers are elevated to unitless would be good. See here.

Enchufa2 commented 6 years ago

unitless is a programming construct that is needed to deal with a programming issue: the case of losing units as a result of operations with units that cancel out. So it is not a unit, conceptually; it literally means no units.

Missing implies different things. Missing means that you know, actually, that there is some unit associated with that data, but it is not known or went missing for some reason. It is quite different to me.

Enchufa2 commented 6 years ago

Oh, and I think that the error from set_units(1, km/h) + 1 should be the same as what you get from set_units(1, km/h) + set_units(1, 1), i.e.: Error: cannot convert 1 into km/h.

billdenney commented 6 years ago

The conversation morphed a bit. I think that there are a few different items here:

Values that have no units (like the number 2 as the example, 2 * 1 km = 2 km).
Values with an unknown magnitude (of NA_real_) and known units, like NA km.
Values with an unknown magnitude and unknown units.

The first is supported in the current library as is, and it should not change in my opinion.

The second is also already supported in the current library, too, and in my opinion, it should not change.

The third is not supported in the current library, and it supports several operations— mainly around the combination of mixed units. The example I gave above is my typical use case. Another example use case is c() with a mixed_units object and NA_real_ (or probably any NA).

I don’t think that it’s sensible to have a non-missing magnitude and an NA unit. For example, set_units(2, NA, mode=“standard”) seems like it should be an error, as it currently is. (There could be a use case that I’m not thinking of, but to me, that is what numeric classes are for.)

Enchufa2 commented 6 years ago

Automatically closed by 8c6bda8, which doesn't allow missing units, so I'm reopening this.

paulponcet commented 5 years ago

I would like to add the following use case to support the implementation of something like NA_unit_. If I read data in a table in a database, the table may (unfortunately, but that is real life) not be well documented, hence I may know the physical meaning of each column and not know the effective unit of each column. In that case, I would like to specify a missing unit for these columns, while waiting for more information from the database holder.

paulponcet commented 4 years ago

Dear package maintainers, do you see any chance that this proposal be accepted sooner or later? I would see this notion of missing unit NA_unit as a valuable feature for this package, and it seems to me that the implementation would not be too difficult.

In my view it should first be clear that:

"unitless" and "missing unit" are two different things, as highlighted in some comments above.
a vector with a missing unit can actually be unitless.

Now, say we have a vector x (unitless, or with a non-missing unit, or with a missing unit) and a vector y with missing unit. Then:

x * y, y * x, x/y, y/x, x^y, y^x should have a missing unit;
x + y and x - y should not be allowed, since without a known unit we don't know whether x and y have the same physical dimension.

Enchufa2 commented 4 years ago

This is still open for two main reasons:

The implementation is not so straightforward because udunits doesn't support missing units, so they need to be handled as a separate case in R, before pushing stuff to udunits. So it requires some care and additional testing to avoid breaking current behaviour.
Re-reading the thread, I'd say we didn't reach consensus.

However, it's easier to gather consensus around something that works, so we could use a PR to refine the proposal. :)

billdenney commented 3 years ago

I'm coming back to this issue as I'm working to develop some additional unit conversion tools for laboratory data. I ahve a thought about implementation. What if we define a unit during package load called "NAunits" as a new base unit:

While that would not require a specific magnitude, it would be a simple way to accomplish the goal of a unit value that is NA. Would that be an acceptable method? The downside I see to the above is that it would be confusing because there would be a unit named "NAunits" and a constant defined named "NAunits", but it would still work as expected, as far as I can see:

The next question would be: How should we handle calls to is.na()? Currently, only the magnitude part appears to be examined. Should we define is.na.units() to look for the units to be NA_units_ in addition to an NA magnitude? I'd suggest that we simply keep using the magnitude check and ignore the units part for is.na().

library(units)
#> udunits database from C:/Users/Bill Denney/Documents/R/win-library/4.1/units/share/udunits/udunits2.xml
install_unit(symbol="NA_units_")
NA_units_ <- set_units(NA_real_, "NA_units_", mode="standard")

set_units(1, "NA_units_")
#> 1 [NA_units_]
set_units(2, NA_units_)
#> 2 [NA_units_]

^{Created on 2021-10-13 by the reprex package (v2.0.1)}

r-quantities / units

Support for NA_units_ #163