dataset_summarize() output > Dataset assessment 'value' shows incomplete/incorrect information.

twey2 commented 4 months ago

I'm not positive what the 'value' column in the output of dataset_summarize() > Dataset assessment is supposed to be showing, but I think there is an error. For categorical variables, it sometimes shows all of the category values, separated by semicolons. But frequently, it shows only an incomplete list of the category values, even when all values show up in the variable summaries (Variables summary (all) and Categorical variable summary). For example, in a dataset I just summarized, variable_3 has values "Male" and "Female". These show up correctly in 'Categorical variable summary': But in 'Dataset assessment', only "Female" shows up. In other variables, all or some of the category values might appear in 'Dataset assessment'. In at least one case, the value is modified/incorrect ("BModerna45O" is changed to "bmoderna45o"). It's very confusing and unclear why sometimes there are errors and other times not.

More generally, the column 'value' is currently confusing. It shows a mix of types of information (category values, the word "text" for text variables, etc.). Instead of 'value', maybe the column should be called "Description of content" of something else. But I think the objective and presentation of this column could be re-evaluated.

GuiFabre commented 4 weeks ago

hello @twey2 Can you tell me, when you are at this issue, if the information has been corrected as expected ? Thank you !

twey2 commented 3 weeks ago

I think this is the test/message mismatch we were just discussing today, so the question still applies to the current summary reports. There are two possible related tests/messages: one for the whole variable, another for specific values within the variable. There is already another test/message for "[INFO] - Categorical values present in dataset that do not match categorical values in data dictionary", so this message should correspond to the check for if a variable is categorical in both the data dictionary (has categories in Categories sheet) and in the dataset (is a factor with levels defined). So it looks like the test script needs to be modified to match the message, but to verify.

maelstrom-research / madshapR

dataset_summarize() output > Dataset assessment 'value' shows incomplete/incorrect information. #80