nestauk / ojd_daps_skills

Nesta's Skills Extractor Library
https://nestauk.github.io/ojd_daps_skills
MIT License
123 stars 20 forks source link

Find allowed camelcase words #25

Closed lizgzil closed 2 years ago

lizgzil commented 2 years ago

As part of the text cleaning we remove words with camelcases, as usually this is due to parsing errors, e.g. "the jobPlease sign". However, some camelcases should be kept in since they aren't mistakes.

The list I've found so far includes ['JavaScript', 'WordPress', 'PowerPoint', 'CloudFormation', 'CommVault', 'InDesign', 'GitHub', 'GitLab', 'DevOps', 'QuickBooks'], but this should be updated as we find more legitimate camelcases in the labelling.

I found this list by outputting all the skills we labelled with camel cases in:

import re
from ojd_daps_skills.getters.data_getters import get_s3_resource, get_s3_data_paths, load_s3_json
from ojd_daps_skills import bucket_name
s3 = get_s3_resource()

labelled_data_s3_folder = "escoe_extension/outputs/skill_span_labels/"

file_names = get_s3_data_paths(s3, bucket_name, labelled_data_s3_folder, "*")
file_names.remove(labelled_data_s3_folder)

all_raw_entities = []
for file_name in file_names:
    job_advert_labels = load_s3_json(s3, bucket_name, file_name)
    text = job_advert_labels["task"]["data"]["text"]
    ent_list = job_advert_labels["result"]
    for ent in ent_list:
        all_raw_entities.append(ent['value']['text'])

compiled_missing_space_pattern = re.compile("([a-z])([A-Z])([a-z])")

camel_cases = []
for text in all_raw_entities:
    if len(re.findall(compiled_missing_space_pattern, text))!=0:
        camel_cases.append(text)
lizgzil commented 2 years ago

New data gives the list: ['JavaScript', 'PowerPoint', 'DevOps', 'TypeScript', 'WordPress', 'CloudFormation', 'CommVault', 'InDesign', 'GitHub', 'GitLab', 'XenDesktop', 'DevSecOps', 'QuickBooks', 'CircleCi', 'LeDeR', 'CeMap', 'MavenAutomation']

Latest camelcases in the newest labels:

[('JavaScript', 8), ('PowerPoint', 3), ('DevOps', 2), ('TypeScript', 2), (' WordPress Development ', 1), ('bespoke WordPress Development ', 1), (' JavaScript', 1), ('WordPress setup ', 1), ('Vanilla JavaScript', 1), ('OO JavaScript expert', 1), ('all-round JavaScript understanding', 1), ('CloudFormation', 1), ('CommVault backup product', 1), ('Experience of Microsoft Word, Excel and PowerPoint', 1), ('InDesign', 1), ('GitHub', 1), ('GitLab', 1), ('exposing automated services towards DevOps teams', 1), ('do the necessary steps to assist those DevOps teams in consuming those services', 1), ('AzureProven', 1), ('XenDesktop', 1), ('DevSecOps', 1), ('Goods in/OutLocation checks in the warehouse', 1), ('QuickBooks', 1), ('knowledge on systems such as; PACPaxtonTexecomGalaxyHikvisionAvigilonLenelIP Systems', 1), ('CircleCi', 1), ('worked in HedgeFunds/ Quantitative Trading platforms', 1), ('maintaining an up-to-date list of local LeDeR Reviewers', 1), ('CeMap qualified', 1), ('experience in Word, Excel and PowerPoint', 1), ('MavenAutomation', 1)]