tom-hc-park / STAT550-450-for-Seniorworkers-from-Korea

0 stars 0 forks source link

Data Pre-processing #8

Closed liuzhen529 closed 6 years ago

liuzhen529 commented 6 years ago

Hi,

I just uploaded my data pre-processing code into 'code' folder. The main work I did is to check normality and missing value.

From the feedback of our client, I created two new datasets: 1)data_priv: employees in the private sector 2)data_both: employees in the private/public sector Please feel free to use these two datasets for further analysis.

But we have a concern about the missing value. If somebody can help us that would be great!! For 2), I deleted the rows that have missing value(39). We wonder if we could use some methods to handle these missing values. This is a categorical variable so using the overall average to replace them may not be correct. We wonder if we could use an algorithm like KNN. Or Is there any other method that can solve this?

Looking forward to your reply. Thanks.

Best, Zhen

NSKrstic commented 6 years ago

I don't think you should delete those 39 observations for the data_both dataset. That may introduce bias in your results. With the data_both data, we can draw our conclusions for senior workers in South Korea generally (rather than specifically public or private), which means it shouldn't matter what sector the workers originate from. This is assuming the data was randomly sampled or that the sample is representative of the population of senior workers. However, if you're working with the data_priv dataset, you should probably remove those observations, just because we have no idea if they are also individuals from the private sector.

Therefore, I don't think you really need to explore methods to handle the missing values for that variable. However, you should make note of this in the data overview within the final report.