soton-data-mining / job-salary-prediction

A regression problem, predicting salaries of jobs in UK based on various criteria
8 stars 3 forks source link

Cleaning: Location RAW - Location Normalised #6

Closed utkuozbulak closed 7 years ago

utkuozbulak commented 7 years ago

Locations are a mess too. Needs to be generalized and cleaned.

E.g: London City of London Central London ( should this be different than those two ? Maybe we can have two features. One for city, one for central or not central etc. ? )

utkuozbulak commented 7 years ago

Location normalized is kind of shit, think its better if we create our own normalized location. @arahayrabedian : I think they tried to automate it and failed

charlienewey commented 7 years ago

Some of these normalised locations are absolutely miles off. I'm now using the Google Maps geocoder to try and come up with a more precise location. New location data is split into 3 categories; postal town, administrative area, and country - like so;

[('Dorking', 'Surrey', 'United Kingdom'), ('Glasgow', 'Glasgow City', 'United Kingdom'), (None, 'Hampshire', 'United Kingdom'), (None, 'Surrey', 'United Kingdom'), (None, 'Surrey', 'United Kingdom'), ('Dorking', 'Surrey', 'United Kingdom'), (None, None, None), (None, 'Greater Manchester', 'United Kingdom'), (None, 'West Yorkshire', 'United Kingdom'), (None, 'Aberdeen City', 'United Kingdom'), ('Derby', 'Leicestershire', 'United Kingdom'), ('Witney', 'Oxfordshire', 'United Kingdom'), ('Bristol', 'South Gloucestershire', 'United Kingdom'), ('Bristol', 'South Gloucestershire', 'United Kingdom'), ('Derby', 'Derby', 'United Kingdom'), ('Gateshead', 'Tyne and Wear', 'United Kingdom'), (None, 'Kent', 'United Kingdom'), (None, 'Norfolk', 'United Kingdom'), ('Bristol', 'South Gloucestershire', 'United Kingdom'), (None, 'West Midlands', 'United Kingdom'), (None, 'City of Bristol', 'United Kingdom'), (None, 'Greater London', 'United Kingdom'), (None, None, 'United Kingdom'), (None, None, 'United Kingdom'), (None, 'Surrey', 'United Kingdom'), (None, None, 'United Kingdom'), (None, None, 'United Kingdom'), (None, 'North Yorkshire', 'United Kingdom'), (None, None, 'United Kingdom'), ('London', 'Greater London', 'United Kingdom')]

charlienewey commented 7 years ago

This is gonna take several days due to rate limits on Google Geocoding APIs

blanche commented 7 years ago

closed by #14 ?