softwaresaved / web-classifier-issues

Secondary repository to capture issues with the job advert web classifier. Contact: @steve-crouch
https://github.com/softwaresaved/web-classifier-issues/issues
0 stars 0 forks source link

Dataset appears to be loosely populated with matches. #1

Closed npch closed 7 years ago

npch commented 8 years ago

It looks like the dataset has a large number of jobs which are not research or research related in it (maintenance jobs, administrators) which means that many of the jobs that I classified appeared to not be research related or software related.

Is there a way of sorting the dataset to present a better mix of jobs than just random picking?

mjsandells commented 8 years ago

Totally agree with this comment. Maybe you want an overall percentage of all jobs, but it will be really really low.

SimonHettrick commented 8 years ago

Olivier and I took a look at the jobs.ac.uk website and their classifications. It looks like around 30% of jobs fall into a non-research role category (admin, finance, etc.). There's no point, as Neil says, in classifying them, so let's knock 'em out, concentrate our dataset and increase our hit rate.

SimonHettrick commented 7 years ago

@Oliph is now working on re-structuring the data in the "job type" field so that we can see how many different job families exist (e.g. research, admin, etc), and count how many jobs in each one. We'll then select which jobs families should make up the study and include only those in the classifier.

steve-crouch commented 7 years ago

Done - closing.