Re-run some analysis, and investgate and fix tk sample

Addresses #72

Re-doing analysis pieces

[x] Skill figures
[x] Taxonomy figures
[ ] Sample analysis
[ ] Job titles analysis
[ ] COVID analysis
[ ] Geography analysis

Date issue

In rerunning the sample analysis I found that the sample size distribution of dates looked unusual. Green bars:

I found this was because in reduce_embeddings.py we don't use repeated sentences. This means the first time a sentence is seen it is added to the data, but the second time it's not, this skews the data to the earlier years. A mapper is created in skills_taxonomy_v2/pipeline/tk_data_analysis/get_duplicate_sentence_skills.py , which for the analysis pieces means we can add some extra job ids for those which had the same sentences as seen elsewhere. We can link these with the cluster that the repeats were given.

After doing this we find a better distribution of dates:

However, there is underrepresentation from the post-COVID dates. So in skills_taxonomy_v2/pipeline/tk_data_analysis/get_no_texts_tk_data.py we find all the job ids which are linked with no full text data.

After investigation, we found that the expired job adverts were making up some of the sample, and these don't have full text fields, and therefore give no skill sentences. Since they often have job advert ids mentioned elsewhere in the data, they were being linked with dates, so the sample looked representative. Thus, it was only when the skill sentences were analysed where it became clear that there wasn't as much data from 2020/21.

Screenshot 2022-01-10 at 17 45 46

To mitigate this, we adapted get_tk_sample.py to replace the sampled job adverts from expired files, with ones from the same folder (thus keeping the same date distribution), but from the none-expired file. This effects 14% of the sample. A nicer fix would just be to resample the entire 5 million job adverts but using job adverts not in the expired files, however since finding skill sentences and embeddings are a costly process and 86% of them are fine - we choose to correct the 14%. Thus, in predict_sentence_class_inc_replacements.py we predict the skill sentences from only the replaced job adverts and append them to a copy of the original 86% of job advert skill sentences.

Because of this problem, this PR changed from one rerunning the analysis to one diagnosing and fixing the bug.

Checklist:

[ ] I have refactored my code out from notebooks/
[ ] I have checked the code runs
[ ] I have tested the code
[ ] I have run pre-commit and addressed any issues not automatically fixed
[ ] I have merged any new changes from dev
[ ] I have documented the code
- [ ] Major functions have docstrings
- [ ] Appropriate information has been added to READMEs
[ ] I have explained the feature in this PR or (better) in output/reports/
[ ] I have requested a code review

nestauk / skills-taxonomy-v2

Re-run some analysis, and investgate and fix tk sample #78

Re-doing analysis pieces

Date issue