Candy Thoughts - Githubissues

I've now mostly cleaned this dataset and want to share a few of the thoughts and questions I had along the way. Hopefully others can weigh in here and suggest the best way to proceed:

Candy Names: after converting to a long format, candy name becomes a character variable and therefore the candy names could be anything. However, some are long or contain odd characters (e.g. smart quotes) and I think some cleaning is in order. To what extent should they be cleaned? Can it be done in an automated fashion or should we just do it by hand for the problem cases? For the time being I've removed all non-alphanumeric characters and converted to lowercase.
Degrees of Separation: I've stored this as a separate table in long format (id, person, degrees). Degrees can be 1, 2, or >3. Should degrees be a factor with three levels or an integer with values 1-3? The latter requires grouping >3 into just 3. For the time being I've kept 2 variables, one factor and one integer.
Age: what ages should be considered unreasonable? There are several outliers over 80, are they from real older people or are they just junk data? I'm inclined to think the latter, and I'd rather no data than bad data, so I've set ages over 80 to NA. Thoughts?
# of Mints: lots of non-numeric data that I've set to NA.
Logical vs. 2-level Factor: there are several questions that have two possible answers: Friday/Sunday, Betty/Veronica, Blue&Black/White&Gold. They could be left as character/factor or converted to logical T/F values. Which is better?
Intelligent Design: this field allows two options, plus a third other field entered as free text. For the time being, I'm ignoring the text in the other field and putting storing this as a 3 level factor: interior design, bullshit, and other. I don't see any value in keeping the free text. Thoughts?
Tears of Sadness: this field comes from a set of 5 check boxes and is stored as the values of the chosen boxes separated by commas. I see two ways to store these data: as a separate 3 column table (id, thing, caused_tears (T/F)) or within the table of respondent level data as 5 T/F columns.
There are 4 totally free text fields that I'm pretty much ignoring the time being: other joy, other despair, comments, and fonts

Candy Names: after converting to a long format, candy name becomes a character variable and therefore the candy names could be anything. However, some are long or contain odd characters (e.g. smart quotes) and I think some cleaning is in order. To what extent should they be cleaned? Can it be done in an automated fashion or should we just do it by hand for the problem cases? For the time being I've removed all non-alphanumeric characters and converted to lowercase.

In my yet-to-be-exposed cleaning thus far, I have only replaced the Smart Quotes (and stripped the leading and trailing []). As for further cleaning, it's hard to get a good result with automation here. Since there are only ~100 candies, I thought about hand curating a table of "original values" and "clean values". Quite possibly you'd also want a truly short form for certain tables and figures?

Degrees of Separation: I've stored this as a separate table in long format (id, person, degrees). Degrees can be 1, 2, or >3. Should degrees be a factor with three levels or an integer with values 1-3? The latter requires grouping >3 into just 3. For the time being I've kept 2 variables, one factor and one integer.

I've been ignoring these variables.

Age: what ages should be considered unreasonable? There are several outliers over 80, are they from real older people or are they just junk data? I'm inclined to think the latter, and I'd rather no data than bad data, so I've set ages over 80 to NA. Thoughts?

I'd probably keep anything compatible with typical human life span. Why not?

# of Mints: lots of non-numeric data that I've set to NA.

Agree.

Logical vs. 2-level Factor: there are several questions that have two possible answers: Friday/Sunday, Betty/Veronica, Blue&Black/White&Gold. They could be left as character/factor or converted to logical T/F values. Which is better?

Not sure it matters much but these I'd probably leave as two-level factors because of the good things you get for free in figures, etc.

Intelligent Design: this field allows two options, plus a third other field entered as free text. For the time being, I'm ignoring the text in the other field and putting storing this as a 3 level factor: interior design, bullshit, and other. I don't see any value in keeping the free text. Thoughts?

Agree. The free text stuff is a quagmire.

Tears of Sadness: this field comes from a set of 5 check boxes and is stored as the values of the chosen boxes separated by commas. I see two ways to store these data: as a separate 3 column table (id, thing, caused_tears (T/F)) or within the table of respondent level data as 5 T/F columns.

Agree. You could imagine analyses that are easier with either form.

There are 4 totally free text fields that I'm pretty much ignoring the time being: other joy, other despair, comments, and fonts.

Yes a mess! I think the fonts are the closest to being useful and interesting to analyze.

jennybc / candy

Candy Thoughts #1