maelstrom-research / Rmonize

3 stars 0 forks source link

Variable names with special characters not recognized during harmo_process #27

Closed zchenmr closed 8 months ago

zchenmr commented 8 months ago

R CITF: Rmonize v1.0.0.9004, madshapR 1.0.2.9004, fabR 2.0.0

In the past, it was possible to run harmo_process() using variables with non-standard names (ex: containing spaces, hyphens) as long as the variable name was surrounded by backticks. However, this now leads to an error. Is there any way to run harmo_process() while keeping the original variable names or would it be recommended to change the names prior to harmonization instead?

image

image

image

zchenmr commented 8 months ago

Actually, I'm not sure if this issue is being caused by the backticks - I got the same error when running harmo_process() on R CanPath with variables that have standard naming. With the exact same harmonization inputs (DataSchema, harmo rules, dataset), harmonizR::harmo_process() runs without any errors but Rmonize::harmo_process() leads to errors:

image

GuiFabre commented 8 months ago

Good issue here. The reason was because (originally) the dataset names must be opal-compatible. As time goes by, this restriction is less and less mandatory (the functions have been changed accordingly.

I will try to find a solution for that.

GuiFabre commented 8 months ago

@zchenmr done to test

zchenmr commented 8 months ago

R CITF/CanPath: Rmonize v1.0.0.9005, madshapR v1.0.2.9007, fabR 2.0.0

One of the errors (for dis_covid_sr_otsp) no longer appears, but there are still errors for the other two:

image image
GuiFabre commented 8 months ago

Hi @zchenmr, and thank you for your comment. unfortunately (or fortunately hehe), it does not come from the package, but from your data processing elements

image

The function harmo_process (to spare some useless steps) creates a subset of your dataset with variables only used in the DPE. The declaration of them is in input_variables . You forgot to put C3_PM_WEIGHT_CA in the list.

Today there is no other way to understand the error from what is provided in show_harmo_error(), but that'll be in the future with the function data_proc_elem_evaluate. Hopefully...

For the other one, I will keep investingating.

GuiFabre commented 8 months ago

Should be ok now !

for your information : in your DPE, in the column input_variables, each variable is actually the name of the variable, so (technically) you must not add the backtick ` around your variable. (the same if for the column name in the data dictionary).

image

While in the column Mlstr_harmo::algorithm, each variable used is the column itself. That is why you need the backtick ` around them. That is why COVID-self_report generated the error, it was double backticked internally. It does not through an error anymore, since, to the user point of you, the might know the difference between the name of a variable and a variable itself.

zchenmr commented 8 months ago

When the backtick is removed, recode now works but direct_mapping doesn't - it still requires the backtick around the variable in "input_variables".

image

Also, thanks for the tip about the missing variable names for the other error! That issue has been resolved.

GuiFabre commented 8 months ago

hoping this time it is working ;)

zchenmr commented 8 months ago

Yes, direct_mapping is working now. Thanks a lot!