soton-data-mining / job-salary-prediction

A regression problem, predicting salaries of jobs in UK based on various criteria
8 stars 3 forks source link

clean company names #37

Closed blanche closed 7 years ago

blanche commented 7 years ago

cleaning company names by 1) removing special chars / lower casing / stripping and so on 2) removing high scoring idf terms using deterministic method such as 'ltd', 'group', 'international'

this reduces the number of unique companies from 18k to ~12k

an open issue in this topic is that some companies names should be merged e.g.

but doing that automatically while preserving companies that should NOT be merged is kind of hard and does more harm than good e.g.

utkuozbulak commented 7 years ago

Cool