maelstrom-research / Rmonize

3 stars 0 forks source link

Error in group_by in harmonized_dossier_visualize if dataschema provided #34

Closed twey2 closed 8 months ago

twey2 commented 8 months ago

Put this on hold for now. Testing something that might resolve the problem.

If I try to run harmonized_dossier_visualize on a pooled harmonized dataset while providing the DataSchema, I get the error "Error: Your grouping variable must be a categorical variable." The parameters I used were:

harmonized_dossier_visualize(
  pooled_harmonized_dataset = data_pooled,
  dataschema = ds,
  group_by = "adm_study_id",
  bookdown_path = "pooled_visual_report")

The variable adm_study_id is categorical, but I also tried with it as a factor. If I remove the 'dataschema = ' line, the function runs ok.

GuiFabre commented 8 months ago

Hi @twey2

Concerning the dataschema ds, did ensure the variable adm_study_id is declared in Categorical sheet ?

image

If yes, there is indeed a bug.

If not, you need to either re-write the data dictionary and add these lines manually in the dataschema, or use a quick fixer upper :

When you update a dataset (here the pool harmonized dataset), the data dictionary do not self update automatically. Indeed the data dictionary is declarative by design. So any change in the dataset does not affect the data dictionary, unless the user says so.

If you turn a variable into a factor, You need to re-apply the data dictionary to the dataset. The function data_dict_apply() does it for if no data_dict is provided (call it a 'data_dict self update' side effect of the function data_dict_apply()).

dataset = dataset %>% mutate(*modification here*) %>% data_dict_apply()

is a short cut for

dataset = dataset %>% mutate(*modification here*)
data_dict = data_dict_extract(dataset)
dataset = data_dict_apply(dataset, data_dict)

image

Can you try this and tell me if it works for you ?

twey2 commented 8 months ago

Thanks @GuiFabre ! It was exactly the issue that the groups had to be added to the Categories. @a-trottier helped point this out. I will close this issue, but a possible enhancement to add a suggestion to check this in the error message and good to keep in mind for best practices as some projects tend to use text for the study IDs.