maelstrom-research / Rmonize

3 stars 0 forks source link

harmo_process() error(s) using Rmonize_examples() objects #97

Open twey2 opened 4 weeks ago

twey2 commented 4 weeks ago

Running harmo_process() with Rmonize_examples() objects produces an ERROR in an algorithm that is written correctly in the DPE and produces some other issues. (Package versions: Rmonize_1.1.0.9106, madshapR_1.1.0.9111, fabR_2.1.0.9103)

Running the following script:

harmonized_dossier <- harmo_process(
  dossier,
  dataschema,
  data_proc_elem,
  harmonized_col_dataset = 'adm_study_id')

Produces the following processing error in dataset_study5: {080FDC5A-2C93-4A56-B064-AA209D50FC06} {88AB16AD-1C58-4477-9C42-434EA16518A0}

The Mlstr_harmo::algorithm appears to be written correctly. The input variable appears to be formatted as a mixed variable (categorical and non-categorical values present). {D5A8478A-128C-43DB-ACBF-59C1E5ADA41D}

The error is not reproduced if the algorithm script is run on the input variable directly in R. {59AEE055-B6C5-4FD6-BC39-A697F312DE26}

In addition, the harmonization process incorrectly prints a message about undetermined statuses (there are none). {91E6DB3C-83C1-4AC3-92AC-22CB11594129}

There are no warning messages about the processing error in dataset_study5. In the harmonized dossier, dataset_study5 is (correctly) empty.

GuiFabre commented 2 weeks ago

Hello @twey2 the bug has been corrected. The problem came from madshapR data_dict_zap_dataset() function which aims at separating the data dictionary from its associated dataset. An interesting case come with a factor variable (or haven_labelled variable in haven). If the the variable remains factor, then (technically) the data dictionary for its particular variable remains (the categories are levels/labels). If not, the column is a regular column, coerced with its associated valueType (no more category/level). Both of the scenarios has its pros and cons.

Originally the default behavior was to drop the levels, then it adressed some issues in madshapR (grouping variables were cast to factor, then unfactored somewhere else in the function summarize()). So, to avoid that, the default behavior was to keep them (a factor is intrinsic to a variable, whether there is a dictionary or not).

Then it consequently adressed other issues in Rmonize (variable to be harmonize kept their factor status, so it is harder to work with it, such as with floor(var) function). But the user shouldn't be bothered with such problems (a factor is intrinsic to R objects and not the values, and the harmonization should work whether the column is a factor or not. Only the value matters)

So to make peace with both scenarios, a parameter zap_factor has been added (working similarly with stringAsFactor = TRUE/FALSE) which preserves (by defaut) the factor, make it work everywhere in both packages ;

and allows to drop them (internally) in the function harmo process.
twey2 commented 1 week ago

This error has been fixed in my most recent check. (Package versions Rmonize_1.1.0.9107, madshapR_1.1.0.9115, fabR_2.1.0.9103) But, lubridate must be loaded otherwise there is an error in study5 sdc_age_m algorithm. Can be fixed by changing algorithm to use "lubridate::years".

GuiFabre commented 1 week ago

Yes, lubridate must be loaded to make the dpe work. This is why it appears in the Suggests, and in 'library(lubridate)' is explicitly called in the example.

thanks !

twey2 commented 1 week ago

I think it would be better to change "years" to "lubridate::years" in the algorithm anyway? The algorithm does already use "lubridate::interval".

GuiFabre commented 1 week ago

yes, sure !