Titles - Githubissues

soton-data-mining / job-salary-prediction

A regression problem, predicting salaries of jobs in UK based on various criteria

8 stars 3 forks source link

Titles #12

Closed alexdy2007 closed 7 years ago

alexdy2007 commented 7 years ago

Began analysing titles and abstracted data_extractors to separate class.

Note: changed list of cities to include all with population above 1000 not 50000 as alot of towns in titles still persisted.

utkuozbulak commented 7 years ago

Im guessing you will update this with @arahayrabedian when issues with titles are done ?

alexdy2007 commented 7 years ago

Yeah, changed most of it now with @arahayrabedian . Should hopefully finish by tomorrow.

General steps

step 1) remove stop words
step 2) remove other stuff
step 3) map title to list of jobs in archive
step 4) map extra info to modifier vector list
step 5) make it binary classification

@utkuozbulak : not loading into pandas as reading into a list to remove stop words and other preprocessing things. planning on then sticking it into a vector:

utkuozbulak commented 7 years ago

"@utkuozbulak : not loading into pandas as reading into a list to remove stop words and other preprocessing things. planning on then sticking it into a vector"

data = pd.read_csv('file') # Already in main specific_feature = data[['column_name']] feature_as_list = pandas_vector_to_list(specific_feature )

def pandas_vector_to_list(pandas_df): # Already in cleaning functions py_list = [item[0] for item in pandas_df.values.tolist()] return py_list

This is super simple instead of reading manually, no ?

alexdy2007 commented 7 years ago

I don't mind either way, manual or not,

arahayrabedian commented 7 years ago

this is all contained as part of #19 , closing, should have just used this to be honest. ma bad.