nestauk / ojd_daps_skills

Nesta's Skills Extractor Library
https://nestauk.github.io/ojd_daps_skills/
120 stars 19 forks source link

Improve mapping #196

Open lizgzil opened 1 year ago

lizgzil commented 1 year ago

Some avenues to improve our mapping algorithm

lizgzil commented 1 year ago

I had a little go at experimenting with ReFinED. I'll write up my results here.

Using ReFinED

pip install https://github.com/amazon-science/ReFinED/archive/refs/tags/V1.zip

from refined.inference.processor import Refined
refined = Refined.from_pretrained(model_name='wikipedia_model_with_numbers',
                                  entity_set="wikipedia")

esco_skill_batch = ['manage musical staff', 'supervise correctional procedures',
       'apply anti-oppressive practices',
       'control compliance of railway vehicles regulations',
       'identify available services', 'perform toxicological studies',
       'ensure coquille uniformity', 'Haskell',
       'apply diplomatic principles', 'lead police investigations']
spans_batch = refined.process_text_batch(esco_skill_batch)
for esco_skill, spans in zip(esco_skill_batch, spans_batch):
    spans = spans.spans
    if spans:
        for span in spans:
            print(esco_skill)
            print(span)
            if (span.candidate_entities) and (span.predicted_entity.wikidata_entity_id):
                    print((
                            span.predicted_entity.wikidata_entity_id,
                            span.predicted_entity.wikipedia_entity_title,
                            span.entity_linking_model_confidence_score
                    ))

This gives:

manage musical staff
['manage', Entity(wikidata_entity_id=Q1320883, wikipedia_entity_title=Talent manager), None]
('Q1320883', 'Talent manager', 0.2776)
control compliance of railway vehicles regulations
['control', None, 'DATE']
ensure coquille uniformity
['coquille', Entity(wikidata_entity_id=Q1778928, wikipedia_entity_title=Permanent mold casting), None]
('Q1778928', 'Permanent mold casting', 0.2473)
Haskell
['Haskell', Entity not linked to a knowledge base, None]

So of the 10 inputted ESCO skills, only 2 of them actually linked to a wikipedia entity, and of these the linking confidence scores are pretty low.

ESCO skills to wiki

First I mapped all the ESCO skill names to wikipedia entries.

This is the linking confidence scores distribution for these 12848 links:

Screenshot 2023-08-09 at 17 55 20

Some of the links when the score was < 0.9: Screenshot 2023-08-09 at 18 01 12

Some of the links when the score was > 0.9: Screenshot 2023-08-09 at 18 01 51

Linking a sample of job advert predicted entities to wiki

Using a sample of 1000 job adverts (from the mixed sample):

This is the linking confidence scores distribution for these 2338 links: Screenshot 2023-08-09 at 17 56 20

Some of the links when the score was < 0.9: Screenshot 2023-08-09 at 18 02 54

Some of the links when the score was > 0.9: Screenshot 2023-08-09 at 18 02 39

Merging

I then mapped the predicted entities to ESCO skills via the linked wiki ids.

e.g if a entity linked to the wiki id 'Q219416' with >0.9 confidence, then I would find which ESCO skills mapped to 'Q219416' with >0.9 confidence, and use these as the output.

This actually only yielded 295 entities which could be mapped to ESCO skills this way.

Some exmaples:

Screenshot 2023-08-09 at 18 06 36

Comparison with the original mapping method

I compared these results to the original way we mapped (using semantic similarity).

This gave 45% of the entities having some cross over in which ESCO skill they were mapped to.

e.g.

the extracted entity "providing excellent customer service" was mapped to ESCO skill "provide outstanding customer service" (7e5786f8-1174-4f75-97e3-cfecfd95d797) using our original method, and via wikipedia entity linking it got matched to several:

['customer service',
  'maintain customer service',
  'customer care',
  'provide customer care',
  'provide outstanding customer service',
  'provide training in customer service techniques',
  'provide training in approaches to customer service',
  'provide training in customer service methods',
  'pursue the highest possible quality of customer service',
  'work to achieve the highest possible level of customer service',
  'act with the goal of providing the highest possible level of customer service',
  'undertake communication with customer service department',
  'work in communication with customer services',
  'correspond with customer services',
  'provide excellence in customer service']

which had the unique IDS:

['15a33d76-4640-438d-ae64-fdc0c1d3eebc',
  '75dfe1ee-5935-42ce-b820-697f827825c3',
  '704fda1b-cd0a-40fe-99fc-0a24250a2010',
  '7e5786f8-1174-4f75-97e3-cfecfd95d797',
  '8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd',
  'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca',
  'b215031a-dd21-48b1-a998-75d6373838d8',
  'e782f412-4cb5-45f1-b5bc-15be441171aa']

(which as you can see includes 7e5786f8-1174-4f75-97e3-cfecfd95d797)

When original way didn't find a match

There were 555 entities which couldn't be mapped to ESCO using the original method (this is when the match is at the least granular level e.g. S1). Of these, only 2 had matches via the wiki method. This were:

Entity: 'Change delivery Project management Business management Stakeholder management Line Management
ESCO match the original way: 'management skills', 'S4'
ESCO matches the wiki way: ['imprinting visionary aspirations into the business management', 'incorporate visionary aspirations into the business management'], '272fddbb-917a-4720-8903-85ce51e1cbe5

and

Entity: Fluent in written and spoken English Application deadline
ESCO match the original way:  'self-management skills and competences', 'T3'
ESCO matches the wiki way: ['interact verbally in English', 'understand spoken English', 'understand written English', 'interacting verbally in English', 'be fluent in English', 'verbally interact in English', 'communicate verbally in English', 'show competency in written English', 'correspond in written English', 'listen to English', 'understanding spoken English', 'comprehend spoken English', 'understand English speech', 'make sense of spoken English', 'interpret spoken English', 'understanding written English', 'interpret written English', 'make sense of written English', 'comprehend written English']),
        list(['0ee9e985-0ee5-4a73-8a12-78b53b261bb2', '64ff8d5f-58a8-4efb-af5a-e161854b3e9a', '7ee20fe2-facd-4cc5-837b-927429e0e7ac', '3993c87c-7719-4186-811b-8ddfb40e76be'])]

Entity length

Here are 10 random entities which were over 60 characters in length,

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

skill | label | orig_mapped_esco_skill | orig_mapped_esco_id | wiki_mapped_esco_skills | unique wiki_mapped_esco_ids | do I think the wiki map is better? -- | -- | -- | -- | -- | -- | -- relevant degree and relevant management experience or equivalent competency gained | EXPERIENCE | management skills | S4.0.0 | ['improve transportation processes through application of management concepts', 'improve transportation processes through application of management principles'] | ['7afd29d7-9c9f-4151-83f2-562b8c94a3af'] | no Transfer all vacant and all-inclusive flats over to a green energy supplier | MULTISKILL | management and administration | K0413 | ['promoting sustainable energy', 'encouraging use of sustainable energy'] | ['1a6c7e0d-fc13-41d7-a5c0-8ca00606de89'] | yes ability to interpret Mechanical drawings Keywords design, estimation, tender, AutoCAD, water, drainage, rainwater, wastewater, Greenford, London | MULTISKILL | interpreting technical documentation and diagrams | S2.1.3 | ['create AutoCAD drawings', 'creation of AutoCAD drawings', 'creating AutoCAD drawings', 'make AutoCAD drawings', 'making of AutoCAD drawings', 'making AutoCAD drawings', 'AutoCAD drawings creation', 'AutoCAD drawing creation', 'drawing with AutoCAD', 'AutoCAD drawing'] | ['76415d7f-0fde-4364-b45f-5c044580d2aa'] | no Perform market research to establish target accounts and contacts | MULTISKILL | performing market research | fe39d4db-4cb5-4299-bb9f-896c8fd6ab13 | ['market research', 'market research performance', 'implement market research'] | ['b011c8b4-76e1-4bbc-8bb9-1d205e7b618a', 'fe39d4db-4cb5-4299-bb9f-896c8fd6ab13'] | same 2+ years of proven track record in account relationship management or customer service | EXPERIENCE | analysing and evaluating information and data | S2.7.0 | ['customer service', 'maintain customer service', 'customer care', 'provide customer care', 'provide outstanding customer service', 'provide training in customer service techniques', 'provide training in approaches to customer service', 'provide training in customer service methods', 'pursue the highest possible quality of customer service', 'work to achieve the highest possible level of customer service', 'act with the goal of providing the highest possible level of customer service', 'undertake communication with customer service department', 'work in communication with customer services', 'correspond with customer services', 'provide excellence in customer service'] | ['15a33d76-4640-438d-ae64-fdc0c1d3eebc', '75dfe1ee-5935-42ce-b820-697f827825c3', '704fda1b-cd0a-40fe-99fc-0a24250a2010', '7e5786f8-1174-4f75-97e3-cfecfd95d797', '8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd', 'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca', 'b215031a-dd21-48b1-a998-75d6373838d8', 'e782f412-4cb5-45f1-b5bc-15be441171aa'] | yes 3-5 years of relevant experience in the planning and management of social development activities | EXPERIENCE | technical or academic writing | S1.13.3 | ['support social change'] | ['644209ac-8452-4e81-959a-2b10050023cc'] | yes Ensure all technical and design information complies with Clients requirements, current Building Regulations | MULTISKILL | integrate building requirements of clients in the architecture designs | bd2102ea-c8d9-40f6-8327-211450120e96 | ['building standards'] | ['615cfc39-797f-4229-8e92-159fcf8f3030'] | no utilisation of project management aligned to the agreed delivery strategy | SKILL | management and administration | K0413 | ['principles of project management'] | ['7111b95d-0ce3-441a-9d92-4c75d05c4388'] | yes Responding to queries via the customer service department received via telephone | SKILL | providing information to the public and clients | S3.4.1 | ['customer service', 'maintain customer service', 'customer care', 'provide customer care', 'provide outstanding customer service', 'provide training in customer service techniques', 'provide training in approaches to customer service', 'provide training in customer service methods', 'pursue the highest possible quality of customer service', 'work to achieve the highest possible level of customer service', 'act with the goal of providing the highest possible level of customer service', 'undertake communication with customer service department', 'work in communication with customer services', 'correspond with customer services', 'provide excellence in customer service'] | ['15a33d76-4640-438d-ae64-fdc0c1d3eebc', '75dfe1ee-5935-42ce-b820-697f827825c3', '704fda1b-cd0a-40fe-99fc-0a24250a2010', '7e5786f8-1174-4f75-97e3-cfecfd95d797', '8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd', 'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca', 'b215031a-dd21-48b1-a998-75d6373838d8', 'e782f412-4cb5-45f1-b5bc-15be441171aa'] | yes Fluent in both German and English with exceptional verbal and written communication skill | MULTISKILL | be fluent in German | 2abb9db5-350c-444c-8292-0e0b2ce00f9a | ['understand spoken German', 'interact verbally in German', 'understand written German', 'understanding spoken German', 'comprehend spoken German', 'listen to German', 'make sense of spoken German', 'communicate verbally in German', 'verbally interact in German', 'interacting verbally in German', 'be fluent in German', 'comprehend written German', 'understanding written German', 'make sense of written German', 'correspond in written German', 'show competency in written German', 'interact verbally in English', 'understand spoken English', 'understand written English', 'interacting verbally in English', 'be fluent in English', 'verbally interact in English', 'communicate verbally in English', 'show competency in written English', 'correspond in written English', 'listen to English', 'understanding spoken English', 'comprehend spoken English', 'understand English speech', 'make sense of spoken English', 'interpret spoken English', 'understanding written English', 'interpret written English', 'make sense of written English', 'comprehend written English'] | ['1d5526b3-f17b-46fc-ba7d-f4a32d908a7e', '2abb9db5-350c-444c-8292-0e0b2ce00f9a', '486e4f39-e968-41f4-955e-56e9eba96ef5', '52894650-9077-40f0-96d6-6f07d1a6cafa', '0ee9e985-0ee5-4a73-8a12-78b53b261bb2', '64ff8d5f-58a8-4efb-af5a-e161854b3e9a', '7ee20fe2-facd-4cc5-837b-927429e0e7ac', '3993c87c-7719-4186-811b-8ddfb40e76be'] | yes

so 6/10 were better (but not necessarily perfect). 1 was the same, and 3 were worse.

Thoughts

india-kerle commented 1 year ago

wow this is so interesting!

On filtering wiki entries, I'm not sure exactly how ReFinED works but when we used wikidata to extract entities from patents/abstracts, there was a way to filter for relevant entities (although the amount of pages that had appropriate tags to filter with was very low)

india-kerle commented 1 year ago

It does feel like we could be adding more complexity - I think treating it as an entity disambiguation problem is still an interesting idea though. Could we treat ESCO as a knowledge base and train our own entity linker to match the extracted skill to a ESCO skill?

I semi looked into spacy's entity linker as part of a personal project familiarising myself with prodigy.

india-kerle commented 1 year ago

re: Big refactor/make the code more clear and mapping to multiple skills (and ignoring entity disambiguation) - we might want to explore the world of vector dbs/vector search because i think they have a lot of this baked in (i.e. surfacing similar data + speed)

india-kerle commented 1 year ago

ok truuuuly live spitballing - what would doing both look like? Is it overkill?

  1. extracted skill -> vectorise -> use faiss/elastic search to identify the top N ESCO skills based on semantic similarity/k nearest neighbours -> label which of the top N is the most appropriate and use that data to train an entity linker to disambiguate the skill
lizgzil commented 1 year ago

ok truuuuly live spitballing - what would doing both look like? Is it overkill?

  1. extracted skill -> vectorise -> use faiss/elastic search to identify the top N ESCO skills based on semantic similarity/k nearest neighbours -> label which of the top N is the most appropriate and use that data to train an entity linker to disambiguate the skill

oo I like it! It does make far more sense to train our own if it works ok. Once we train the EL model then I guess we don't need to apply all the vectorisation+faiss/elastic steps anymore, so is there a benefit to implementing these to just create the training data? (i.e. maybe our current mappings will do?) I can have a little look at spacy EL too.

india-kerle commented 1 year ago

I think if we go down the EL route, that's a good point - i don't really think there's additional benefit to implementing vector dbs beyond creating training data. I wonder how we can quickly assess that approach? what about training an EL model on an engineered training set with skills that our current approach consistently does not match very well on?

india-kerle commented 1 year ago

I've already written a custom entity linker recipe in prodigy as part of a personal project so probably could get the labelling side of things up and running relatively quickly -- https://github.com/india-kerle/viclit_food_linker/blob/main/src/cake_recipe.py

lizgzil commented 1 year ago

oo amazing! I was thinking we'd be training on our existing labelled data (rather than creating more)? We'd have to reconfigure the data into the correct format though which might be tricky (but surely easier than labelling more?). But - I'm not sure - do you think we'd need to relabel?

I've been following this and the training data is in the form

{"text":"Interestingly, Emerson is one of only five tennis players all-time to win multiple slam sets in two disciplines, only matched by Frank Sedgman, Margaret Court, Martina Navratilova and Serena Williams.","_input_hash":2024197919,"_task_hash":-1926469210,"spans":[{"start":15,"end":22,"text":"Emerson","rank":0,"label":"ORG","score":1,"source":"en_core_web_lg","input_hash":2024197919}],"meta":{"score":1},"options":[{"id":"Q48226","html":"<a href='https://www.wikidata.org/wiki/Q48226'>Q48226: American philosopher, essayist, and poet</a>"},{"id":"Q215952","html":"<a href='https://www.wikidata.org/wiki/Q215952'>Q215952: Brazilian footballer</a>"},{"id":"Q312545","html":"<a href='https://www.wikidata.org/wiki/Q312545'>Q312545: Australian tennis player</a>"},{"id":"NIL_otherLink","text":"Link not in options"},{"id":"NIL_ambiguous","text":"Need more context"}],"_session_id":null,"_view_id":"choice","accept":["Q312545"],"answer":"accept"}