How to improve the accuracy while classifying short text with less context

Hi, my usecase is to classify Job Title into Functional Areas. I finetuned all-mpnet-base-v2 with the help of setfit by providing some 10+ examples for each class (Functional Areas).

I got 82% accuracy on running the evaluation on my test set. I observed some of the simple & straightforward job titles are classified into wrong label with 0.6 score.

For example:

Query: SDET
Predicted Label: Big Data / DWH / ETL
Confidence Scores:
Label: Accounting / Finance, Confidence: 0.0111
Label: Backend Development, Confidence: 0.0140
Label: Big Data / DWH / ETL, Confidence: 0.6092

Here SDET should have labelled as QA / SDET but it is classified to Big Data / DWH / ETL with 0.62 score. Few shot examples used for both classes doesn't have anything in common which could confuse the model except one example whose title is Data Quality Engineer and it is under Big Data / DWH / ETL.

Few shot examples (added only for 2 here)

{    "QA / SDET": [
        "Quality Assurance Engineer",
        "Software Development Engineer in Test (SDET)",
        "QA Automation Engineer",
        "Test Engineer",
        "QA Analyst",
        "Manual Tester",
        "Automation Tester",
        "Performance Test Engineer",
        "Security Test Engineer",
        "Mobile QA Engineer",
        "API Tester",
        "Load & Stress Test Engineer",
        "Senior QA Engineer",
        "Test Automation Architect",
        "QA Lead",
        "QA Manager",
        "End-to-End Tester",
        "Game QA Tester",
        "UI/UX Tester",
        "Integration Test Engineer",
        "Quality Control Engineer",
        "Test Data Engineer",
        "DevOps QA Engineer",
        "Continuous Integration (CI) Tester",
        "Software Test Consultant"
    ],

    "Big Data / DWH / ETL": [
        "Big Data Engineer",
        "Data Warehouse Developer",
        "ETL Developer",
        "Hadoop Developer",
        "Spark Developer",
        "Data Engineer",
        "Data Integration Specialist",
        "Data Pipeline Engineer",
        "Data Architect",
        "Database Administrator",
        "ETL Architect",
        "Data Lake Engineer",
        "Informatica Developer",
        "DataOps Engineer",
        "BI Developer",
        "Data Migration Specialist",
        "Data Warehouse Architect",
        "ETL Tester",
        "Big Data Platform Engineer",
        "Apache Kafka Engineer",
        "Snowflake Developer",
        "Data Quality Engineer",
        "Data Ingestion Engineer",
        "Big Data Consultant",
        "ETL Manager"
    ]
}

TrainingArgs

args = TrainingArguments(
    batch_size=16,
    num_epochs=1,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

Here is the complete set of functional areas.

functional_areas = [
    "Accounting / Finance",
    "Backend Development",
    "Big Data / DWH / ETL",
    "Brand Management",
    "Content Writing",
    "Customer Service",
    "Data Analysis / Business Intelligence",
    "Data Science / Machine Learning",
    "Database Admin / Development",
    "DevOps / Cloud",
    "Embedded / Kernel Development",
    "Event Management",
    "Frontend Development",
    "Full-Stack Development",
    "Functional / Technical Consulting",
    "General Management / Strategy",
    "IT Management / IT Support",
    "IT Security",
    "Mobile Development",
    "Network Administration",
    "Online Marketing",
    "Operations Management",
    "PR / Communications",
    "QA / SDET",
    "SEO / SEM",
    "Sales / Business Development"
]

My guess is accuracy is low because of short text (which is just job title). Please suggest few things which I can try out to improve the accuracy of the model.

Not sure if you figured this out already, but let me drop a comment in case someone else has the same problem.

It's very important to think about what your model is working on - which is essentially a vector representation of your text (job title) that we then transform into a prediction using a prediction model (head). I had to look up SDET, and it's almost guaranteed that a non-specific model will probably have a poor representation of that title there. Therefore, you would need a lot of examples to classify it correctly, and even then another very specific term is as likely to occupy that space as not, so all your finetuning goes to waste. You can find the datasets this was trained on here https://huggingface.co/sentence-transformers/all-mpnet-base-v2 (and you can see the vast majority is things that do not relate to your domain at all).

Things you can do to improve this relatively painlessly (as in, there are multiple tutorials on these):

Finetune your embedding model using software job descriptions that are likely to contain your specific terms - this is likely the best way forward, or use a model already pre-trained for your domain
Use a different embedding model (e.g. a count vectorizer - on words or even ngrams) - semantic embeddings are not always the best way forward for everything, especially if your names are quite structured. You can also combine both of these in various interesting ways.

Whether more labels actually helps you is an open question.

A more involved approach would be to restate the objective (i.e. change the model), including by adding more information than the job title.

Hopefully this helped a little

huggingface / setfit

How to improve the accuracy while classifying short text with less context #558