Cleaning/category - Githubissues

soton-data-mining / job-salary-prediction

A regression problem, predicting salaries of jobs in UK based on various criteria

8 stars 3 forks source link

Cleaning/category #27

Closed andreaseliasson closed 7 years ago

andreaseliasson commented 7 years ago

@arahayrabedian: fixed pep8 import issue.

andreaseliasson commented 7 years ago

@arahayrabedian: branch has been rebased on top of master and latest commit reflects the Oldmainmethod.py setup.

arahayrabedian commented 7 years ago

still looks good, haven't run it myself but eyeballing it looks nice.

utkuozbulak commented 7 years ago

Umm, can you help me understand something: Note from the first commit:

For now this function is used to clean the category strings by removing the 'Jobs' substring. The 'Jobs' is superfluous data which appears in each category so we can remove `

If, before this conversion there are N unique categories, after this conversion there are same N different categories (removing 'jobs' from each one). Whats the point ? When we encode it having 'job' in the string doesn't matter since everything will be encoded.

Or, were there some categories such as 'Engineering' and 'Engineering Jobs' and with this cleaning we reduced the amount of unique categories ? Is this the case ?

andreaseliasson commented 7 years ago

@utkuozbulak: I see your point. Removing 'Jobs' will not reduce the number of unique categories.

My reasoning for doing this was to reduce the amount of encoding that had to be done. Consequently reducing the amount of bits that had to be used and thus reducing space and possibly run-time.

But if the encoding process doesn't take into account the length of the categories, then we can discard the changes. If so, I will update the branch to only include the minor fixes (we have duplicate 'ContractTime' features).

utkuozbulak commented 7 years ago

Encoding(the way we do it now) doesn't take the amount of bits into account so the from that perspective it won't change but there is no need to discard the change. Let it stay like this, maybe we will end up changing the way we encode and then it might help. Who knows ?