r-quantities / units

Measurement units for R
https://r-quantities.github.io/units
173 stars 27 forks source link

Invalid unit name with install_unit() does not give an error #289

Open billdenney opened 2 years ago

billdenney commented 2 years ago

I'm working with clinical laboratory data where a unit is defined for hemoglobin called "g%" (gram-percent) which is equal to 1 g/dL. While trying to create a new unit for it, I guessed that the unit name would not work because of the percent sign, but it didn't give an error when trying to install the unit; it only gave an error when trying to use the unit.

It would be helpful if install_unit() did a check to confirm that the unit was successfully installed before returning.

library(units)
#> udunits database from C:/Users/Bill Denney/Documents/R/win-library/4.1/units/share/udunits/udunits2.xml
install_unit(symbol="g_hemoglobin")
install_unit(symbol="mol_hemoglobin", def="100/0.006206 g_hemoglobin")
install_unit(symbol="g%_hemoglobin", def="1 g_hemoglobin/dL")
foo <- set_units(10, "g_hemoglobin/dL", mode="standard")
set_units(foo, "mmol_hemoglobin/L", mode="standard")
#> 6.206 [mmol_hemoglobin/L]
set_units(foo, "g%_hemoglobin", mode="standard")
#> Error: In 'g%_hemoglobin', 'g%_hemoglobin' is not recognized by udunits.
#> 
#> See a table of valid unit symbols and names with valid_udunits().
#> Custom user-defined units can be added with install_unit().
#> 
#> See a table of valid unit prefixes with valid_udunits_prefixes().
#> Prefixes will automatically work with any user-defined unit.

Created on 2021-10-13 by the reprex package (v2.0.1)

billdenney commented 2 years ago

To be clear, I don't think that the units library should necessarily support "g%_hemoglobin". The request is just that an error is generated.

Enchufa2 commented 2 years ago

Agree. The thing is... the udunits2 call succeeds, but then the unit is not available. I don't know what happens internally, but I suppose it has something to do with the fact that the percentage symbol is part of the grammar. So this issue should be raised upstream.

Enchufa2 commented 2 years ago

Actually, it succeeds because it is permitted and it works:

#include <stdio.h>
#include <udunits2/udunits2.h>

int main() {
    ut_set_error_message_handler((ut_error_message_handler) ut_ignore);
    ut_system *sys = ut_read_xml(NULL);
    ut_encoding enc = UT_UTF8;
    ut_set_error_message_handler((ut_error_message_handler) vprintf);

    // install g_hemoglobin
    ut_unit *g_hemoglobin = ut_new_base_unit(sys);
    ut_map_symbol_to_unit("g_hemoglobin", enc, g_hemoglobin);
    ut_map_unit_to_symbol(g_hemoglobin, "g_hemoglobin", enc);

    // install g%_hemoglobin
    ut_unit *gpc_hemoglobin = ut_parse(sys, "1 g_hemoglobin/dL", enc);
    ut_map_symbol_to_unit("g%_hemoglobin", enc, gpc_hemoglobin);
    ut_map_unit_to_symbol(gpc_hemoglobin, "g%_hemoglobin", enc);

    // conversion
    char *from = "g_hemoglobin/dL", to[128];
    ut_unit *from_u = ut_parse(sys, from, enc);
    cv_converter *cv = ut_get_converter(from_u, gpc_hemoglobin);
    ut_format(gpc_hemoglobin, to, sizeof(to), enc);
    double x = 10;
    printf("From: %f %s\n", x, from);
    printf("To  : %f %s\n", cv_convert_double(cv, x), to);

    cv_free(cv);
    ut_free(from_u);
    ut_free(g_hemoglobin);
    ut_free(gpc_hemoglobin);
    ut_free_system(sys);
    return 0;
}

Save this as test.c and then

$ gcc test.c -l udunits2 && ./a.out
From: 10.000000 g_hemoglobin/dL
To  : 10.000000 g%_hemoglobin

So this means that we are doing something wrong with that percentage symbol somewhere.

Enchufa2 commented 2 years ago

However:

#include <stdio.h>
#include <udunits2/udunits2.h>

int main() {
    ut_set_error_message_handler((ut_error_message_handler) ut_ignore);
    ut_system *sys = ut_read_xml(NULL);
    ut_encoding enc = UT_UTF8;
    ut_set_error_message_handler((ut_error_message_handler) vprintf);

    // install g_hemoglobin
    ut_unit *g_hemoglobin = ut_new_base_unit(sys);
    ut_map_symbol_to_unit("g_hemoglobin", enc, g_hemoglobin);
    ut_map_unit_to_symbol(g_hemoglobin, "g_hemoglobin", enc);

    // install g%_hemoglobin
    ut_unit *gpc_hemoglobin = ut_parse(sys, "1 g_hemoglobin/dL", enc);
    ut_map_symbol_to_unit("g%_hemoglobin", enc, gpc_hemoglobin);
    ut_map_unit_to_symbol(gpc_hemoglobin, "g%_hemoglobin", enc);

    // conversion works
    char *from = "g_hemoglobin/dL", to[128];
    ut_unit *from_u = ut_parse(sys, from, enc);
    cv_converter *cv = ut_get_converter(from_u, gpc_hemoglobin);
    ut_format(gpc_hemoglobin, to, sizeof(to), enc);
    double x = 10;
    printf("From: %f %s\n", x, from);
    printf("To  : %f %s\n", cv_convert_double(cv, x), to);

    // parsing doesn't work
    ut_unit *u = ut_parse(sys, to, enc);
    if (!u) printf("NULL unit!\n");

    cv_free(cv);
    ut_free(from_u);
    ut_free(g_hemoglobin);
    ut_free(gpc_hemoglobin);
    ut_free_system(sys);
    return 0;
}
$ gcc test.c -l udunits2 && ./a.out
From: 10.000000 g_hemoglobin/dL
To  : 10.000000 g%_hemoglobin
NULL unit!
billdenney commented 2 years ago

Thanks for reporting it upstream!

billdenney commented 2 years ago

Since there doesn't appear to be movement upstream, would it be reasonable to add some form of test here to see if it works? Perhaps within install_units() add a test like set_units(1, new_unit_name), see if it succeeds, and if not raise an informative error to the user?

Enchufa2 commented 2 years ago

The thing is that, as shown in the reprex, installation and even conversion works. It's parsing after the unit was installed what has a bug and doesn't work, and it's only for these cases with a percentage in the name. So not a fan of adding this test just for a bug in an edge case which in turn succeeded (and the database of units is already "polluted" with an unparseable identifier).

We will just not allow percentage characters in symbols/names until this issue is properly solved upstream (implemented in c9e958fe32ea6e15338925f16bd24f1f9cd8fb99):

units::install_unit(symbol="g%_hemoglobin", def="1 g_hemoglobin/dL")
#> Error in units::install_unit(symbol = "g%_hemoglobin", def = "1 g_hemoglobin/dL"): 
#>   Symbols/names with a percentage character '%' are not allowed (see https://github.com/r-quantities/units/issues/289)
billdenney commented 2 years ago

Unfortunately, the UDUNITS2 grammar is more complex. As I read it, it allows a percent sign by itself but not a percent sign as part of a longer units string (https://www.unidata.ucar.edu/software/udunits/udunits-2.1.24/udunits2lib.html#Grammar). So, this would break the percent ("%") unit.

I'm guessing that it does not make sense for you to do a full grammar check before installing the unit. So, this may be a better thing to wait until fixed upstream.

Enchufa2 commented 2 years ago

No, this doesn't break anything. We are just disabling any new user-provided symbol or name with a percentage sign in it. The existing percent unit is safely loaded from the system XML.