soton-data-mining / job-salary-prediction

A regression problem, predicting salaries of jobs in UK based on various criteria
8 stars 3 forks source link

NLP: Generalize/Clean Job Titles #4

Open utkuozbulak opened 7 years ago

utkuozbulak commented 7 years ago

Job titles are a mess, we need someone to generalize them.

We can maybe have two different features, say: Main title -sub title: Engineering Systems Analyst - Mathematical Modeller ?? Engineering Systems Analyst - Water Industry ?? Examples: 1- Engineering Systems Analyst Engineering Systems Analyst / Mathematical Modeller Engineering Systems Analyst Water Industry

2- Accounts Assistant/Credit Controller Accounts Assistant

arahayrabedian commented 7 years ago

initially done as #19, but needs a lot of tuning still

  1. stemmers/lemmatizers were selected arbitarily, i've heard good things about snowball stemmers but didn't specifically evaluate it. Wordnet lemmatizer did not break down words far enough.
  2. the final form, a 'sorted lemmatized sentence' may need reconsideration - it's a first stopgap way to get things rolling, but is it the best option? does sentence order actually matter?
utkuozbulak commented 7 years ago

Question related to : "does sentence order actually matter?" Do you ask if there is a difference between : Senior project engineer Project engineer senior Project senior engineer

If so, it doesn't matter as long as they are consistent. If you represent all of above as snr engineer prj(or any combination of 3 words), then its super fine, but if words are represented in different order each time: snr engineer prj prj snr engineer Then we have a problem because they are two different entities even though in reality they are not.

utkuozbulak commented 7 years ago

Are we going to anything else on this or what we have is enough ? @arahayrabedian @alexdy2007