sdv-dev / CTGAN

Conditional GAN for generating synthetic tabular data.
Other
1.25k stars 282 forks source link

Should a 5-Likert scale be treated as either continuous or discrete? #289

Closed gihunjin closed 1 year ago

gihunjin commented 1 year ago

I'm going to create synthetic data through CTGAN with 42 questions on the 5-point Likert scale and 8 questions of categorical variables. In my analysis, should the Likert scale be treated as either discrete or continuous?

npatki commented 1 year ago

Hi @gihunjin this is a great question. Categorical variables can be especially tricky. I would recommend you use the SDV library's CTGANSynthesizer. This wraps around the CTGAN library and adds some convenient features for preprocessing.

For a Likert scale, I think this depends on how your data is stored:

  1. As ordered numbers (1, 2, 3 ...): I'd recommend treating these as numerical.
  2. As strings (eg. "Strongly Agree", "Agree", etc.): In this case, you'll have to mark is as categorical. But if you use the SDV, you can customize the preprocessing for this column. I'd recommend the OrderedLabelEncoder

Both of the above options would ultimately end up with the same data quality.

PS. We also recently wrote a blog post about it. Check it out here and let us know what you think.

npatki commented 1 year ago

Hello, I'm closing off this issue since our last discussion was a month ago. I think it it resolved since we answered the question?

Please feel free to reply if there is more to discuss. We can always reopen the issue for investigation.