Open lizgzil opened 1 year ago
I had a little go at experimenting with ReFinED. I'll write up my results here.
pip install https://github.com/amazon-science/ReFinED/archive/refs/tags/V1.zip
from refined.inference.processor import Refined
refined = Refined.from_pretrained(model_name='wikipedia_model_with_numbers',
entity_set="wikipedia")
esco_skill_batch = ['manage musical staff', 'supervise correctional procedures',
'apply anti-oppressive practices',
'control compliance of railway vehicles regulations',
'identify available services', 'perform toxicological studies',
'ensure coquille uniformity', 'Haskell',
'apply diplomatic principles', 'lead police investigations']
spans_batch = refined.process_text_batch(esco_skill_batch)
for esco_skill, spans in zip(esco_skill_batch, spans_batch):
spans = spans.spans
if spans:
for span in spans:
print(esco_skill)
print(span)
if (span.candidate_entities) and (span.predicted_entity.wikidata_entity_id):
print((
span.predicted_entity.wikidata_entity_id,
span.predicted_entity.wikipedia_entity_title,
span.entity_linking_model_confidence_score
))
This gives:
manage musical staff
['manage', Entity(wikidata_entity_id=Q1320883, wikipedia_entity_title=Talent manager), None]
('Q1320883', 'Talent manager', 0.2776)
control compliance of railway vehicles regulations
['control', None, 'DATE']
ensure coquille uniformity
['coquille', Entity(wikidata_entity_id=Q1778928, wikipedia_entity_title=Permanent mold casting), None]
('Q1778928', 'Permanent mold casting', 0.2473)
Haskell
['Haskell', Entity not linked to a knowledge base, None]
So of the 10 inputted ESCO skills, only 2 of them actually linked to a wikipedia entity, and of these the linking confidence scores are pretty low.
First I mapped all the ESCO skill names to wikipedia entries.
This is the linking confidence scores distribution for these 12848 links:
Some of the links when the score was < 0.9:
Some of the links when the score was > 0.9:
Using a sample of 1000 job adverts (from the mixed sample):
This is the linking confidence scores distribution for these 2338 links:
Some of the links when the score was < 0.9:
Some of the links when the score was > 0.9:
I then mapped the predicted entities to ESCO skills via the linked wiki ids.
e.g if a entity linked to the wiki id 'Q219416' with >0.9 confidence, then I would find which ESCO skills mapped to 'Q219416' with >0.9 confidence, and use these as the output.
This actually only yielded 295 entities which could be mapped to ESCO skills this way.
Some exmaples:
I compared these results to the original way we mapped (using semantic similarity).
This gave 45% of the entities having some cross over in which ESCO skill they were mapped to.
e.g.
the extracted entity "providing excellent customer service" was mapped to ESCO skill "provide outstanding customer service" (7e5786f8-1174-4f75-97e3-cfecfd95d797) using our original method, and via wikipedia entity linking it got matched to several:
['customer service',
'maintain customer service',
'customer care',
'provide customer care',
'provide outstanding customer service',
'provide training in customer service techniques',
'provide training in approaches to customer service',
'provide training in customer service methods',
'pursue the highest possible quality of customer service',
'work to achieve the highest possible level of customer service',
'act with the goal of providing the highest possible level of customer service',
'undertake communication with customer service department',
'work in communication with customer services',
'correspond with customer services',
'provide excellence in customer service']
which had the unique IDS:
['15a33d76-4640-438d-ae64-fdc0c1d3eebc',
'75dfe1ee-5935-42ce-b820-697f827825c3',
'704fda1b-cd0a-40fe-99fc-0a24250a2010',
'7e5786f8-1174-4f75-97e3-cfecfd95d797',
'8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd',
'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca',
'b215031a-dd21-48b1-a998-75d6373838d8',
'e782f412-4cb5-45f1-b5bc-15be441171aa']
(which as you can see includes 7e5786f8-1174-4f75-97e3-cfecfd95d797)
There were 555 entities which couldn't be mapped to ESCO using the original method (this is when the match is at the least granular level e.g. S1). Of these, only 2 had matches via the wiki method. This were:
Entity: 'Change delivery Project management Business management Stakeholder management Line Management
ESCO match the original way: 'management skills', 'S4'
ESCO matches the wiki way: ['imprinting visionary aspirations into the business management', 'incorporate visionary aspirations into the business management'], '272fddbb-917a-4720-8903-85ce51e1cbe5
and
Entity: Fluent in written and spoken English Application deadline
ESCO match the original way: 'self-management skills and competences', 'T3'
ESCO matches the wiki way: ['interact verbally in English', 'understand spoken English', 'understand written English', 'interacting verbally in English', 'be fluent in English', 'verbally interact in English', 'communicate verbally in English', 'show competency in written English', 'correspond in written English', 'listen to English', 'understanding spoken English', 'comprehend spoken English', 'understand English speech', 'make sense of spoken English', 'interpret spoken English', 'understanding written English', 'interpret written English', 'make sense of written English', 'comprehend written English']),
list(['0ee9e985-0ee5-4a73-8a12-78b53b261bb2', '64ff8d5f-58a8-4efb-af5a-e161854b3e9a', '7ee20fe2-facd-4cc5-837b-927429e0e7ac', '3993c87c-7719-4186-811b-8ddfb40e76be'])]
Here are 10 random entities which were over 60 characters in length,
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
skill | label | orig_mapped_esco_skill | orig_mapped_esco_id | wiki_mapped_esco_skills | unique wiki_mapped_esco_ids | do I think the wiki map is better? -- | -- | -- | -- | -- | -- | -- relevant degree and relevant management experience or equivalent competency gained | EXPERIENCE | management skills | S4.0.0 | ['improve transportation processes through application of management concepts', 'improve transportation processes through application of management principles'] | ['7afd29d7-9c9f-4151-83f2-562b8c94a3af'] | no Transfer all vacant and all-inclusive flats over to a green energy supplier | MULTISKILL | management and administration | K0413 | ['promoting sustainable energy', 'encouraging use of sustainable energy'] | ['1a6c7e0d-fc13-41d7-a5c0-8ca00606de89'] | yes ability to interpret Mechanical drawings Keywords design, estimation, tender, AutoCAD, water, drainage, rainwater, wastewater, Greenford, London | MULTISKILL | interpreting technical documentation and diagrams | S2.1.3 | ['create AutoCAD drawings', 'creation of AutoCAD drawings', 'creating AutoCAD drawings', 'make AutoCAD drawings', 'making of AutoCAD drawings', 'making AutoCAD drawings', 'AutoCAD drawings creation', 'AutoCAD drawing creation', 'drawing with AutoCAD', 'AutoCAD drawing'] | ['76415d7f-0fde-4364-b45f-5c044580d2aa'] | no Perform market research to establish target accounts and contacts | MULTISKILL | performing market research | fe39d4db-4cb5-4299-bb9f-896c8fd6ab13 | ['market research', 'market research performance', 'implement market research'] | ['b011c8b4-76e1-4bbc-8bb9-1d205e7b618a', 'fe39d4db-4cb5-4299-bb9f-896c8fd6ab13'] | same 2+ years of proven track record in account relationship management or customer service | EXPERIENCE | analysing and evaluating information and data | S2.7.0 | ['customer service', 'maintain customer service', 'customer care', 'provide customer care', 'provide outstanding customer service', 'provide training in customer service techniques', 'provide training in approaches to customer service', 'provide training in customer service methods', 'pursue the highest possible quality of customer service', 'work to achieve the highest possible level of customer service', 'act with the goal of providing the highest possible level of customer service', 'undertake communication with customer service department', 'work in communication with customer services', 'correspond with customer services', 'provide excellence in customer service'] | ['15a33d76-4640-438d-ae64-fdc0c1d3eebc', '75dfe1ee-5935-42ce-b820-697f827825c3', '704fda1b-cd0a-40fe-99fc-0a24250a2010', '7e5786f8-1174-4f75-97e3-cfecfd95d797', '8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd', 'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca', 'b215031a-dd21-48b1-a998-75d6373838d8', 'e782f412-4cb5-45f1-b5bc-15be441171aa'] | yes 3-5 years of relevant experience in the planning and management of social development activities | EXPERIENCE | technical or academic writing | S1.13.3 | ['support social change'] | ['644209ac-8452-4e81-959a-2b10050023cc'] | yes Ensure all technical and design information complies with Clients requirements, current Building Regulations | MULTISKILL | integrate building requirements of clients in the architecture designs | bd2102ea-c8d9-40f6-8327-211450120e96 | ['building standards'] | ['615cfc39-797f-4229-8e92-159fcf8f3030'] | no utilisation of project management aligned to the agreed delivery strategy | SKILL | management and administration | K0413 | ['principles of project management'] | ['7111b95d-0ce3-441a-9d92-4c75d05c4388'] | yes Responding to queries via the customer service department received via telephone | SKILL | providing information to the public and clients | S3.4.1 | ['customer service', 'maintain customer service', 'customer care', 'provide customer care', 'provide outstanding customer service', 'provide training in customer service techniques', 'provide training in approaches to customer service', 'provide training in customer service methods', 'pursue the highest possible quality of customer service', 'work to achieve the highest possible level of customer service', 'act with the goal of providing the highest possible level of customer service', 'undertake communication with customer service department', 'work in communication with customer services', 'correspond with customer services', 'provide excellence in customer service'] | ['15a33d76-4640-438d-ae64-fdc0c1d3eebc', '75dfe1ee-5935-42ce-b820-697f827825c3', '704fda1b-cd0a-40fe-99fc-0a24250a2010', '7e5786f8-1174-4f75-97e3-cfecfd95d797', '8d10ae08-3b0d-4bbb-86c5-25dd2c6858cd', 'a15dab55-f1da-4f85-ae3e-2b5c5b5333ca', 'b215031a-dd21-48b1-a998-75d6373838d8', 'e782f412-4cb5-45f1-b5bc-15be441171aa'] | yes Fluent in both German and English with exceptional verbal and written communication skill | MULTISKILL | be fluent in German | 2abb9db5-350c-444c-8292-0e0b2ce00f9a | ['understand spoken German', 'interact verbally in German', 'understand written German', 'understanding spoken German', 'comprehend spoken German', 'listen to German', 'make sense of spoken German', 'communicate verbally in German', 'verbally interact in German', 'interacting verbally in German', 'be fluent in German', 'comprehend written German', 'understanding written German', 'make sense of written German', 'correspond in written German', 'show competency in written German', 'interact verbally in English', 'understand spoken English', 'understand written English', 'interacting verbally in English', 'be fluent in English', 'verbally interact in English', 'communicate verbally in English', 'show competency in written English', 'correspond in written English', 'listen to English', 'understanding spoken English', 'comprehend spoken English', 'understand English speech', 'make sense of spoken English', 'interpret spoken English', 'understanding written English', 'interpret written English', 'make sense of written English', 'comprehend written English'] | ['1d5526b3-f17b-46fc-ba7d-f4a32d908a7e', '2abb9db5-350c-444c-8292-0e0b2ce00f9a', '486e4f39-e968-41f4-955e-56e9eba96ef5', '52894650-9077-40f0-96d6-6f07d1a6cafa', '0ee9e985-0ee5-4a73-8a12-78b53b261bb2', '64ff8d5f-58a8-4efb-af5a-e161854b3e9a', '7ee20fe2-facd-4cc5-837b-927429e0e7ac', '3993c87c-7719-4186-811b-8ddfb40e76be'] | yes
Some avenues to improve our mapping algorithm