soton-data-mining / job-salary-prediction

A regression problem, predicting salaries of jobs in UK based on various criteria
8 stars 3 forks source link

Add code to normalise location #14

Closed charlienewey closed 7 years ago

andreaseliasson commented 7 years ago

Looks good to me. I assume there is no need to run the python file as the API-key has been revoked?

charlienewey commented 7 years ago

Yes, there's no need to run the Python file - I think the most recent version of the dataset is on Ara's MongoDB instance.

charlienewey commented 7 years ago

As promised, latest data. This is also on Ara's DB. We can merge now.

http://www.edshare.soton.ac.uk/18257/

utkuozbulak commented 7 years ago

http://www.edshare.soton.ac.uk/18257/ 'This resource is empty'

Removed Ara's review, feel free to merge. ( and provide csv :smiling_imp: )

arahayrabedian commented 7 years ago

@charlienewey all the other branches are using src/some_package, src/smg is a little weird. (unless of course github is showing me strange things, which it sometimes does)

i guess we could start a preprocessing package here and throw in the stuff @alexdy2007 and i are working on in to that same package and they'll both fall in to the same package.

arahayrabedian commented 7 years ago

@utkuozbulak i didn't dismiss it because it's still src/smg

utkuozbulak commented 7 years ago

@arahayrabedian Shit, you are right its still src/smg/ Im sorry :cry: :sob:

Block again ! :rofl: :rofl:

charlienewey commented 7 years ago

Whoops, that was accidental - blame beer. Fixed. And @utkuozbulak check edshare again, it was still uploading when you checked last time.

Ninja edit: WAT IT DIDN'T UPLOAD

charlienewey commented 7 years ago

Try this. https://www.dropbox.com/s/0ig91yyvyoda56x/jobs_norm_loc.tar.gz?dl=0

Ninja re-edit: EdShare does work now, after I said some rude words at it and re-uploaded the archive

arahayrabedian commented 7 years ago

silly edshare, thinking itself useful.

LGTM.

charlienewey commented 7 years ago

Yeah, there are quite a few errors/things that don't make sense, but there are 250k records in the dataset. If Google's geocoder doesn't like what's in the "LocationRaw" field (I'm not entirely sure that it's deterministic...), then we'll get some missing bits - sometimes the output from the geocoder doesn't tag the parts of the location correctly (i.e. it might come back with an administrative area but not a town etc), but that's just something we've gotta learn to live with, I think - it's noisy data, after all.

utkuozbulak commented 7 years ago

I wont learn to live with it, I will search for ways to fix it.