nestauk / skills-taxonomy-v2

new skills taxonomy using TextKernel data
MIT License
29 stars 11 forks source link

Re-run some analysis, and investgate and fix tk sample #78

Closed lizgzil closed 2 years ago

lizgzil commented 2 years ago

Addresses #72

Re-doing analysis pieces

Date issue

In rerunning the sample analysis I found that the sample size distribution of dates looked unusual. Green bars:

Screenshot 2021-12-24 at 00 43 03

I found this was because in reduce_embeddings.py we don't use repeated sentences. This means the first time a sentence is seen it is added to the data, but the second time it's not, this skews the data to the earlier years. A mapper is created in skills_taxonomy_v2/pipeline/tk_data_analysis/get_duplicate_sentence_skills.py , which for the analysis pieces means we can add some extra job ids for those which had the same sentences as seen elsewhere. We can link these with the cluster that the repeats were given.

After doing this we find a better distribution of dates:

Screenshot 2021-12-24 at 11 58 38

However, there is underrepresentation from the post-COVID dates. So in skills_taxonomy_v2/pipeline/tk_data_analysis/get_no_texts_tk_data.py we find all the job ids which are linked with no full text data.

After investigation, we found that the expired job adverts were making up some of the sample, and these don't have full text fields, and therefore give no skill sentences. Since they often have job advert ids mentioned elsewhere in the data, they were being linked with dates, so the sample looked representative. Thus, it was only when the skill sentences were analysed where it became clear that there wasn't as much data from 2020/21.

Screenshot 2022-01-10 at 17 45 46

To mitigate this, we adapted get_tk_sample.py to replace the sampled job adverts from expired files, with ones from the same folder (thus keeping the same date distribution), but from the none-expired file. This effects 14% of the sample. A nicer fix would just be to resample the entire 5 million job adverts but using job adverts not in the expired files, however since finding skill sentences and embeddings are a costly process and 86% of them are fine - we choose to correct the 14%. Thus, in predict_sentence_class_inc_replacements.py we predict the skill sentences from only the replaced job adverts and append them to a copy of the original 86% of job advert skill sentences.

Because of this problem, this PR changed from one rerunning the analysis to one diagnosing and fixing the bug.

Checklist: