Open utkuozbulak opened 7 years ago
for my understanding: we will want to remove entire features, not just a single binary encoded version of it i.e. we will remove all comp_name 1,2,3 and so on.. not just comp_name_253
of course we will want to remove entire features, we will need a function that takes a list of strings as inputs and remove all the features that contains that string
feature_to_remove = ['company','title']
def remove_feature(feature_list[] ) ... ... ... returns the dataset with all company and title features removed ( company_1, 2,3,4... )
Currently encoding functions return only the data, need to update them to return both data and the column name to handle column names better. E.g: one_hot_encoded_company_name , company_name_columns = one_hot_encode(company_name_feature)
here, company_name_columns should be something like : [ 'comp_name_1', 'comp_name_2', 'comp_name_3' ... ] as much as needed
Main driver for this, is now, we dont know which column is what after encoding stuff happens. How do we do feature selection without knowing ?