Data Clarification - Githubissues

liuzhen529 commented 6 years ago

Hi, everyone. Hope you have a good reading break. I am now working on checking significant factors of proficiency test scores. Before processing variable selection, I looked statistics summary first. There are some issues:

1. Participation in Education: In our data set, there are five variables about participation in education. Three of them are Participation in job-related adult education(FNFAET12JR), Participation in non-job-related adult education(FNFAET12NJR), and Participation in adult education(FNFAET12). I found that numerically FNFAET12 = FNFAET12JR + FNFAET12NJR in our data set. I think it might be an issue if we keep them all since they are highly correlated.# Is that okay we do not use FNFAET12 but use FNFAET12JR AND FNFAET12NJR as our independent variables?

2. Numeracy skill usage for work (num_use) The max of ‘num_use’ is 5 but there is one observation whose ‘num_use is 7’. I checked other variables in this observation and they are all in their ranges. I guessed 7 may be a wrong input. I am not sure whether we could delete this observation or not.

'num_use' is one of the response variables that we want to investigate. When investigating other responses like 'proficiency test score', I guess it would be okay if we keep this observation since we would not use 'num_use' in this case.

3. Education level(ED_Level) In data description, there are only four levels(1,2,3,4: primary, middle, high, college) but the one of the ED_Level is 8. I checked the codebook and it shows that 8 means “Tertiary – research degree (ISCED 6)”. In education level categorical variable, there is only one observation whose value is 8, I am not sure if we need to delete this observation.

Looking forward to suggestions!! Thanks a lot:)

Best, Zhen

NSKrstic commented 6 years ago

Hi Zhen,

Yes, it is true that participation in adult education is the union of the other two variables. And yes, it's also true that these variables are likely highly correlated. That being said, although it would be "fine" to use FNFAET12, one of the other two variables may turn out to be a better predictor in your model. This would be best investigated in your exploratory analysis. Since these variables are binary, you can create boxplots and conduct t-tests to determine which variables to include/exclude. It could be that "num_use" is not significantly different between categories of FNFAET12, but for one of FNFAET12JR or FNFAET12NJR.
I recommend excluding the observation, since it's likely an artefact of the data. However, this should be noted in your methodology in the report (and maybe the discussion as well). For the analysis that we'll conduct on proficiency test scores, then yes, it's likely fine to keep it.
That does sound a little suspicious... Also, I noticed in the codebook that 8 can represent one of two categories (EDCAT7 and EDCAT8):

“Tertiary – research degree (ISCED 6)” “Tertiary - bachelor/master/research degree (ISCED 5A/6)”

I'm more reluctant to remove this observation, but the fact that you only have one observation for a single category means you'll end up with a pretty poor estimate of the corresponding effect. It may be best to classify this as category 4 since it represents post-secondary education.

Cheers, Nikolas

liuzhen529 commented 6 years ago

Thank you so much! Your words are really helpful!!

Also, I found that these two columns are exactly the same: FNFE12JR and FNFAET12JR (job-related education vs job-related adult education). Shall we just delete one of them?

Best, Zhen

tom-hc-park commented 6 years ago

If the two covariates are exactly the same, without any single difference, I think it is okay to delete one. I guess identical two variable happens because the client narrowed down the dataset to contain only senior workers in Korea. Senior workers would join adult education almost surely... They won't join childhood education I think...

KellyHu commented 6 years ago

I found the same issue... I already checked the document and there was no information about these two columns.

Lindaaaaaa commented 6 years ago

Another issue I found is that for year_wk (year of work work). It has a strange value 96. unique(newdata$Years_wk) [1] 23 15 24 28 27 30 25 5 22 11 6 2 32 18 20 31 1 16 3 7 34 35 10 26 4 8 19 33 37 17 0 9 29 [34] 12 21 38 13 36 40 39 14 96 43 45 44 41 42 47 46

It's not really possible to have people work for 96 years because we only consider people aging from 50-65. Shall we exclude those rows?

KellyHu commented 6 years ago

The pub_priv variable has 39 missing values among the 1247 observations, which is around 3% of the original data. I googled methods dealing with missing values and found that we can guess, take the average or ignore the values. In our case, I think ignoring the values is the most suitable way since only around 3% is missing. Let me know if you have any advice on dealing with the missing values in pub_pri.

NSKrstic commented 6 years ago

Well pub_priv is just the indicator of whether an individual worked in the private or public sector, correct? Then if you decided to work on the pooled data (public and private) this shouldn't be a problem (and shouldn't be a predictor in your model). Otherwise, if only working with the private sector data, then don't include those 39 since we don't know what class they are. However, once again, make note of this in the methods or discussion of your report (preferably methods).

The 96 years of work is also an artefact. You can set this to NA but keep the observation. It may be that "Years_wk" is not a good predictor in your model, so you don't want to throw out the observation since it could still contribute information for your other predictors.

The same goes for the "num_use" artefact. Keep the observation, since it shouldn't affect your models for the proficiency scores.

tom-hc-park / STAT550-450-for-Seniorworkers-from-Korea

Data Clarification #14