Warning during unit conversions

ehwenk commented 1 year ago

This ongoing warning when building AusTraits is continuing to baffle us:

Warning in View :
  There was 1 warning in `dplyr::mutate()`.
ℹ In argument: `value = ifelse(...)`.
ℹ In group 2: `ucn = "a-mo"`, `to_convert = TRUE`.
Caused by warning in `unit_conversion_functions[[name]]()`:
! NAs introduced by coercion

@yangsophieee and I assumed it was because of measurements where value_type = bin that required unit conversions. But that is now solved and the error persists.

These lines of code in process.R cause the problem:

 data %>%
    dplyr::group_by(.data$ucn, .data$to_convert) %>%
    dplyr::mutate(
      value = ifelse(.data$to_convert & !.data$value_type %in% c("bin", "range") & !is.na(.data$value), 
                     f(.data$value, .data$ucn[1]), .data$value), #exclude bins and ranges, or get a warning
      split_values_1 = ifelse(.data$to_convert & .data$value_type %in% c("bin", "range") & !is.na(.data$split_values_1), 
                              f(.data$split_values_1, .data$ucn[1]), .data$split_values_1),
      split_values_2 = ifelse(.data$to_convert & .data$value_type %in% c("bin", "range") & !is.na(.data$split_values_2), 
                              f(.data$split_values_2, .data$ucn[1]), .data$split_values_2),
      unit = ifelse(.data$to_convert, .data$to, .data$unit),
      value = ifelse(!is.na(.data$split_values_1) & !is.na(.data$split_values_2), paste0(.data$split_values_1, "--", .data$split_values_2), .data$value),
      value = ifelse(!is.na(.data$split_values_1) & is.na(.data$split_values_2), paste0(.data$split_values_1, "--"), .data$value)
    ) %>%
    dplyr::ungroup()  %>%
    dplyr::select(dplyr::any_of(vars))

I've looked in detail at building "vanderMoezel_1987", because it requires complex combinations of conversions. The trait "fire_time_from_fire_to_flowering" is input in years (a) and needs to be converted to months (mo). There are a mix of value_type = bin and value_type = mean. This dataset also has the trait reproductive_maturity, also with a mix of value_type = bin and value_type = mean, but in this case the input and output units are years (a).

I assumed there was a special character (e.g. <) causing the NA's or some problem with data from the wrong value_type being converted by the wrong line of code, but using browser and looking at the output after each line of code and filtering to different subsections of data, there are:

1) no problems with the output, ever; no unwanted NA's in the value or units column; all values convert as expected 2) for this dataset it is the first line value = ifelse(.data$to_convert ... that creates the problem.

ehwenk commented 1 year ago

If you filter to a single value_type (either mean or bin) before mutating the warning goes away - even though any variant of filtering to a single value_type within the ifelse statement has no effect. OR If you comment out the offending line, the problem goes away. I've tried changing:

value = ifelse(.data$to_convert & !.data$value_type %in% c("bin", "range") & !is.na(.data$value), 
                     f(.data$value, .data$ucn[1]), .data$value)

to

value = ifelse(.data$to_convert & .data$value_type %in% c("mean") & !is.na(.data$value), 
                     f(.data$value, .data$ucn[1]), .data$value)

and get the same error.

Playing with the metadata.yml file for the study:

it is a single row of data or a single trait that causes the error
if no unit conversions are required the error, not surprisingly, vanishes
I've tried retyping the single value (a 1) many times; there are no funny characters if you look at the data.csv file in a text editor.

ehwenk commented 1 year ago

@yangsophieee @dfalster I really need another set of eyes here!

I've isolated it to a single offending row of data in vanderMoezel_1987. No other dataset in AusTraits gives this error - but probably no other dataset has this combination of unit conversions with bins and reading in value_type from a column.

Diuris longifolia,R.Br.,Northern Sandplain shrublands,Laterite,Sprout,1,mean,nnnnnnynnnnn,,

If I change 1,mean to 1--1, bin there is no warning; if I simply delete the values there is no warning. I just don't know if the problem is the conversion from years to months or some interaction of reading in value_type from a column with different value types for different rows of data within the same trait.

dfalster commented 1 year ago

For future, if you have a warning that you need to track down, you can run

options(warn=2)            # converts warning into an error
options(error=recover)  # trigger browser on any error/warning

You'll need to rebuild all of austraits to discover the dataset causing it.

the above couch also helps debugging the stuff above

yangsophieee commented 1 year ago

I remade AusTraits on the update-taxon-list branch and did not get any warnings. I've tried again on develop and got the warning on Pirralho_2014:

Error: (converted from warning) There was 1 warning in `dplyr::mutate()`.
i In argument: `value = ifelse(.data$to_convert, f(.data$value, .data$ucn[1]), .data$
i In group 1: `ucn = "%-mm2/mm2"`, `to_convert = TRUE`.
Caused by warning in `unit_conversion_functions[[name]]()`:
! NAs introduced by coercion

Looking into it with Dan's suggested options() code..

yangsophieee commented 1 year ago

I have a feeling this error comes from the as.character or as.numeric functions on this line as I've gotten the same warning before from these functions.

As you can see, the warning message “NAs introduced by coercion” is returned and some output values are NA (i.e. missing data or not available data). The reason for this is that some of the character strings are not properly formatted numbers and hence cannot be converted to the numeric class.

Looking at Pirralho_2014 where the warning came up, I could see some values like "79±11" that would trigger this warning.

Although I'm not sure if this explains all of the scenarios Lizzy mentioned..

I removed the '≤', '<' and '>' from vanderMoezel_1987 data and warning seems to have disappeared for it too. I also noticed that the values with '≤', etc. aren't included in the final traits table for fire_time_from_fire_to_flowering, probably because it requires a unit conversion from years to months. Whereas '≤', etc. are fine for reproductive_maturity because units don't need to be converted.

This probably means we need to add another test to dataset_test to look for special characters in numeric variables.

ehwenk commented 1 year ago

@yangsophieee That is one of my ideas too - but substitutions in vanderMoezel_1987 and custom_R_code in Pirralho_2014 (see metadata_check_custom_R_code("Pirralho_2014")) should have taken care off all the special characters. But maybe the problem is that this removes the special characters, but then they aren't properly converted to numeric. And the as.numeric() doesn't work within the unit conversions?

yangsophieee commented 1 year ago

When I ran browser() the special characters were still there during unit conversions so it could be that substitutions happen later. I will check.

ehwenk commented 1 year ago

@yangsophieee Confirming you're running off the read-in-units-from-a-column branch?

yangsophieee commented 1 year ago

Oh I wasn't, sorry! I'll try again on that branch.

yangsophieee commented 1 year ago

It seems that the special characters can't be detected during the substitutions process either, nor with custom_R_code. It's something that has to be manually edited in the data.csv file potentially...

yangsophieee commented 1 year ago

'±' is in the exceptions for disallowed characters but I think it shouldn't be allowed because it can't be detected with custom_R_code or substitutions, hence causing problems for Pirralho_2014

dfalster commented 1 year ago

Can you elaborate why we can't handle with a substitution or custom code?

Don't you think '±' is kind of handy when writing metadata? So would be good to keep?

We currently just have a generic tests for allowed characters that applies equally to all metadata and the data. We could have a stricter test for the data.

ehwenk commented 1 year ago

With the '≤' and '≥' it seems to be a universal problem (described https://community.rstudio.com/t/unicode-characters-and-are-getting-replaced-by/93419/8 as well), and I think because (for now) the substitutions are only working on finding/replacing the entire string, maybe it isn't working?

With the '±' it seems to be machine specific. On Pirralho_2014, if I run metadata_check_custom_code the '±' are vanishing as specified, but not for @yangsophieee . It does seem that we need to use the unicode codes to do the substitutions, although I suspect if we make the suggested changes to substitutions to allow either "entire string" or "word" replacements, maybe we can work through substitutions. But I think having a solution that works with custom_R_code is good enough for now - certainly with both of these studies, we want mass replacements of a pattern that is not a "word" and therefore the proposed changes to substitutions wouldn't work easily anyway.

yangsophieee commented 1 year ago

I was wrong, Lizzy figured out you can use the unicode codes to replace these in custom_R_code and substitutions so all good. But I think we do need a test to detect whether these make it into the data or are properly substituted.

dfalster commented 1 year ago

So, one approach would be to cause the build to fail at the point that we convert the data to numeric (i.e. in unit conversions). That way you can't proceed. And at that point spit out a message saying why/where it failed.

yangsophieee commented 1 year ago

@dfalster It seems that the generic tests for allowed characters don't actually look within the data, only at the colnames of the data. But I'm not sure if we want it to look in the data if we've decided to use custom_R_code/substitutions to replace the special characters.

Could another approach for testing just be that we test if there are special characters in traits after process_parse_data? We don't want the build to fail at the point we convert the data to numeric (unit conversions) because it was only throwing a warning when unit conversions were needed (sometimes they aren't and the special characters still make it through).

yangsophieee commented 1 year ago

I've added a test in dataset_test to check whether there are special characters in the data after process_parse_data. Then I noticed that this conversion in the custom_R_code for Pirralho_2014 was working within dataset_test but not within dataset_process. I realised it's because the metadata is read in with read_yaml in dataset_test but read_metadata in dataset_process. read_metadata rewrites custom_R_code to include formatting but it can't recognise ± (reads in as Â±). My question is, how come we need to preserve formatting in custom_R_code?

Secondly, even though there's now a dataset_test test for this, I think there also needs to be a process_flag_... function which flags these special character values and adds them to excluded_data. Right now the values are being added to excluded_data only when unit conversions need to happen, and the error shows up as "Value does not convert to numeric" -- which is true but could be more specific.

ehwenk commented 1 year ago

closed with 1636c31

traitecoevo / traits.build

Warning during unit conversions #65