Adapting build_taxonomy.py for use with new skills.
New config config/skills_taxonomy/2021.11.30.yaml
skills_taxonomy_v2/utils/2021.12.03_level_a_mapper_dict.json added to manually group level B skill groups into Level A (as well as the names for level A groups)
pipeline/skills_extraction/skills_naming_embeddings.py script to output mean non-reduced embedding for each skills. Needed for skills naming pipeline.
Slight refactor of build_taxonomy.py with loading data functions. New data is in slightly different format (due to it being bigger) so there are more if statements to make allowances for this
New results:
Building hierarchy with 1,465,639 sentences and 6783 skills
Lowest level hierarchy has 250 sections
Mid level hierarchy has 68 sections
Top level hierarchy has 12 sections
New names:
There are 3090 unique skill names. 789 names have more than one skill attached
With suffixes using the level C name, plus some numbers we have unique names for all skills
Checklist:
[ ] I have refactored my code out from notebooks/
[ ] I have checked the code runs
[ ] I have tested the code
[ ] I have run pre-commit and addressed any issues not automatically fixed
[ ] I have merged any new changes from dev
[ ] I have documented the code
[ ] Major functions have docstrings
[ ] Appropriate information has been added to READMEs
[ ] I have explained the feature in this PR or (better) in output/reports/
Addressing #71
Adapting build_taxonomy.py for use with new skills.
config/skills_taxonomy/2021.11.30.yaml
skills_taxonomy_v2/utils/2021.12.03_level_a_mapper_dict.json
added to manually group level B skill groups into Level A (as well as the names for level A groups)pipeline/skills_extraction/skills_naming_embeddings.py
script to output mean non-reduced embedding for each skills. Needed for skills naming pipeline.build_taxonomy.py
with loading data functions. New data is in slightly different format (due to it being bigger) so there are moreif
statements to make allowances for thisNew results:
New names:
Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
soutput/reports/