Closed gihunjin closed 1 year ago
Hi @gihunjin this is a great question. Categorical variables can be especially tricky. I would recommend you use the SDV library's CTGANSynthesizer. This wraps around the CTGAN
library and adds some convenient features for preprocessing.
For a Likert scale, I think this depends on how your data is stored:
1, 2, 3 ...
): I'd recommend treating these as numerical."Strongly Agree", "Agree", etc.
): In this case, you'll have to mark is as categorical. But if you use the SDV, you can customize the preprocessing for this column. I'd recommend the OrderedLabelEncoderBoth of the above options would ultimately end up with the same data quality.
PS. We also recently wrote a blog post about it. Check it out here and let us know what you think.
Hello, I'm closing off this issue since our last discussion was a month ago. I think it it resolved since we answered the question?
Please feel free to reply if there is more to discuss. We can always reopen the issue for investigation.
I'm going to create synthetic data through CTGAN with 42 questions on the 5-point Likert scale and 8 questions of categorical variables. In my analysis, should the Likert scale be treated as either discrete or continuous?