Closed ehwenk closed 1 year ago
If you filter to a single value_type
(either mean
or bin
) before mutating the warning goes away - even though any variant of filtering to a single value_type
within the ifelse statement has no effect. OR If you comment out the offending line, the problem goes away. I've tried changing:
value = ifelse(.data$to_convert & !.data$value_type %in% c("bin", "range") & !is.na(.data$value),
f(.data$value, .data$ucn[1]), .data$value)
to
value = ifelse(.data$to_convert & .data$value_type %in% c("mean") & !is.na(.data$value),
f(.data$value, .data$ucn[1]), .data$value)
and get the same error.
Playing with the metadata.yml
file for the study:
1
) many times; there are no funny characters if you look at the data.csv file in a text editor.@yangsophieee @dfalster I really need another set of eyes here!
I've isolated it to a single offending row of data in vanderMoezel_1987
. No other dataset in AusTraits gives this error - but probably no other dataset has this combination of unit conversions with bins and reading in value_type from a column.
Diuris longifolia,R.Br.,Northern Sandplain shrublands,Laterite,Sprout,1,mean,nnnnnnynnnnn,,
If I change 1,mean
to 1--1, bin
there is no warning; if I simply delete the values there is no warning. I just don't know if the problem is the conversion from years to months or some interaction of reading in value_type from a column with different value types for different rows of data within the same trait.
For future, if you have a warning that you need to track down, you can run
options(warn=2) # converts warning into an error
options(error=recover) # trigger browser on any error/warning
You'll need to rebuild all of austraits to discover the dataset causing it.
the above couch also helps debugging the stuff above
I remade AusTraits on the update-taxon-list
branch and did not get any warnings. I've tried again on develop
and got the warning on Pirralho_2014:
Error: (converted from warning) There was 1 warning in `dplyr::mutate()`.
i In argument: `value = ifelse(.data$to_convert, f(.data$value, .data$ucn[1]), .data$
i In group 1: `ucn = "%-mm2/mm2"`, `to_convert = TRUE`.
Caused by warning in `unit_conversion_functions[[name]]()`:
! NAs introduced by coercion
Looking into it with Dan's suggested options()
code..
I have a feeling this error comes from the as.character
or as.numeric
functions on this line as I've gotten the same warning before from these functions.
Looking at Pirralho_2014 where the warning came up, I could see some values like "79±11" that would trigger this warning.
Although I'm not sure if this explains all of the scenarios Lizzy mentioned..
I removed the '≤', '<' and '>' from vanderMoezel_1987 data and warning seems to have disappeared for it too. I also noticed that the values with '≤', etc. aren't included in the final traits
table for fire_time_from_fire_to_flowering
, probably because it requires a unit conversion from years to months. Whereas '≤', etc. are fine for reproductive_maturity
because units don't need to be converted.
This probably means we need to add another test to dataset_test
to look for special characters in numeric variables.
@yangsophieee That is one of my ideas too - but substitutions in vanderMoezel_1987 and custom_R_code in Pirralho_2014 (see metadata_check_custom_R_code("Pirralho_2014")) should have taken care off all the special characters. But maybe the problem is that this removes the special characters, but then they aren't properly converted to numeric. And the as.numeric()
doesn't work within the unit conversions?
When I ran browser() the special characters were still there during unit conversions so it could be that substitutions happen later. I will check.
@yangsophieee Confirming you're running off the read-in-units-from-a-column
branch?
Oh I wasn't, sorry! I'll try again on that branch.
It seems that the special characters can't be detected during the substitutions process either, nor with custom_R_code
. It's something that has to be manually edited in the data.csv
file potentially...
'±' is in the exceptions for disallowed characters but I think it shouldn't be allowed because it can't be detected with custom_R_code
or substitutions
, hence causing problems for Pirralho_2014
Can you elaborate why we can't handle with a substitution or custom code?
Don't you think '±' is kind of handy when writing metadata? So would be good to keep?
We currently just have a generic tests for allowed characters that applies equally to all metadata and the data. We could have a stricter test for the data.
With the '≤' and '≥' it seems to be a universal problem (described https://community.rstudio.com/t/unicode-characters-and-are-getting-replaced-by/93419/8 as well), and I think because (for now) the substitutions are only working on finding/replacing the entire string, maybe it isn't working?
With the '±' it seems to be machine specific. On Pirralho_2014, if I run metadata_check_custom_code
the '±' are vanishing as specified, but not for @yangsophieee . It does seem that we need to use the unicode codes to do the substitutions, although I suspect if we make the suggested changes to substitutions to allow either "entire string" or "word" replacements, maybe we can work through substitutions. But I think having a solution that works with custom_R_code is good enough for now - certainly with both of these studies, we want mass replacements of a pattern that is not a "word" and therefore the proposed changes to substitutions wouldn't work easily anyway.
I was wrong, Lizzy figured out you can use the unicode codes to replace these in custom_R_code
and substitutions
so all good. But I think we do need a test to detect whether these make it into the data or are properly substituted.
So, one approach would be to cause the build to fail at the point that we convert the data to numeric (i.e. in unit conversions). That way you can't proceed. And at that point spit out a message saying why/where it failed.
@dfalster It seems that the generic tests for allowed characters don't actually look within the data, only at the colnames
of the data. But I'm not sure if we want it to look in the data if we've decided to use custom_R_code
/substitutions
to replace the special characters.
Could another approach for testing just be that we test if there are special characters in traits
after process_parse_data
? We don't want the build to fail at the point we convert the data to numeric (unit conversions) because it was only throwing a warning when unit conversions were needed (sometimes they aren't and the special characters still make it through).
I've added a test in dataset_test
to check whether there are special characters in the data after process_parse_data
. Then I noticed that this conversion in the custom_R_code
for Pirralho_2014 was working within dataset_test
but not within dataset_process
. I realised it's because the metadata is read in with read_yaml
in dataset_test
but read_metadata
in dataset_process
. read_metadata
rewrites custom_R_code
to include formatting but it can't recognise ±
(reads in as ±
). My question is, how come we need to preserve formatting in custom_R_code
?
Secondly, even though there's now a dataset_test
test for this, I think there also needs to be a process_flag_...
function which flags these special character values and adds them to excluded_data
. Right now the values are being added to excluded_data
only when unit conversions need to happen, and the error shows up as "Value does not convert to numeric" -- which is true but could be more specific.
This ongoing warning when building AusTraits is continuing to baffle us:
@yangsophieee and I assumed it was because of measurements where
value_type = bin
that required unit conversions. But that is now solved and the error persists.These lines of code in process.R cause the problem:
I've looked in detail at building "vanderMoezel_1987", because it requires complex combinations of conversions. The trait "fire_time_from_fire_to_flowering" is input in years (a) and needs to be converted to months (mo). There are a mix of
value_type = bin
andvalue_type = mean
. This dataset also has the traitreproductive_maturity
, also with a mix ofvalue_type = bin
andvalue_type = mean
, but in this case the input and output units are years (a).I assumed there was a special character (e.g. <) causing the NA's or some problem with data from the wrong
value_type
being converted by the wrong line of code, but using browser and looking at the output after each line of code and filtering to different subsections of data, there are:1) no problems with the output, ever; no unwanted NA's in the value or units column; all values convert as expected 2) for this dataset it is the first line
value = ifelse(.data$to_convert ...
that creates the problem.