dataset_visualize() report bar plot for open-text variables shows duplicate values differing by case

zchenmr commented 9 months ago

The dataset_visualize() report generates bar plots for open-text variables that sometimes show the same value (with differing capitalization) multiple times.

zchenmr commented 7 months ago

R Maelstrom: madshapR v1.0.4.1005

Values are now grouped together regardless of capitalization, but there are still duplicate values due to special symbols. I'm not sure if these should be differentiated or not - I guess in some cases the accent could change the meaning of the word? Not sure how common that would be though.

GuiFabre commented 7 months ago

Thank you for your contribution ! I would suggest that gather upper and lower case, but leave accents separated is the expected behaviour. Indeed, in French, many word have different meaning with or without accents, regardless the case. Lets say in a hospital :

"un interne tue" "un interné tue" "un interne tué" "un interné tué"

are four complete different stories :)

GuiFabre commented 7 months ago

Hello @zchenmr. I added a tiny change in this topic :

  library(tidyr)
  library(madshapR)

  dataset = tibble(iris %>%
    mutate(Species = c(rep("Setosa",50),rep("SETOSA",50),rep("setosa",50))) %>%
    mutate(var = c(rep("Aa",25),rep("aA",85),rep("aa",40))))

when a variable is declared as a group, then the case stays, along with its data dictionary declaration ("Setosa" and "SETOSA" are different) when a variable is declared as a category, then the case stays, along with its data dictionary declaration ("aa" and "AA" are different) when a variable is declared as a text (in any case, not as category) then the case is lowered, to avoid duplicated entries.

In a nutshell : a category has its case kept, a text has not.

Based on that,

variable_visualize(
  dataset = dataset,  # var is a text
  col = 'var',
  group_by = "Species")

variable_visualize(
  dataset = dataset  %>% mutate(var = as_category(var)),   # var is a category
  col = 'var',
  group_by = "Species")

I hope that fits both the need you highlight without changing the data dictionary declaration, if declared.

zchenmr commented 7 months ago

That sounds great, thanks!

maelstrom-research / madshapR

dataset_visualize() report bar plot for open-text variables shows duplicate values differing by case #63