soton-data-mining / job-salary-prediction

A regression problem, predicting salaries of jobs in UK based on various criteria
8 stars 3 forks source link

Update Encoding Functions #38

Open utkuozbulak opened 7 years ago

utkuozbulak commented 7 years ago

Currently encoding functions return only the data, need to update them to return both data and the column name to handle column names better. E.g: one_hot_encoded_company_name , company_name_columns = one_hot_encode(company_name_feature)

here, company_name_columns should be something like : [ 'comp_name_1', 'comp_name_2', 'comp_name_3' ... ] as much as needed

Main driver for this, is now, we dont know which column is what after encoding stuff happens. How do we do feature selection without knowing ?

blanche commented 7 years ago

for my understanding: we will want to remove entire features, not just a single binary encoded version of it i.e. we will remove all comp_name 1,2,3 and so on.. not just comp_name_253

utkuozbulak commented 7 years ago

of course we will want to remove entire features, we will need a function that takes a list of strings as inputs and remove all the features that contains that string

feature_to_remove = ['company','title']

def remove_feature(feature_list[] ) ... ... ... returns the dataset with all company and title features removed ( company_1, 2,3,4... )