novik2713 / data-circle-team-C

0 stars 0 forks source link

Future Engineering #18

Open udithac opened 5 months ago

udithac commented 5 months ago

Working on the followibg Future Engineering tasks:

  1. Years of Professional Coding Experience: Create a new feature by converting YearsCodePro (number of years coding professionally) into a numerical format suitable for analysis.
  2. Years of Overall Coding Experience: Similarly, convert YearsCode (number of years coding in total, including non-professional experience) into a numerical format.
  3. Education Level Encoding: Encode EdLevel (highest education level) into numerical categories, e.g., associate's degree = 1, bachelor's degree = 2, master's degree = 3, etc.
  4. Remote Work Encoding: Encode RemoteWork (remote work preference) into binary categories (0 or 1) indicating whether the respondent prefers remote work or not.
  5. Development Role One-Hot Encoding: Use DevType (development role) to create binary features indicating the presence or absence of specific development roles (e.g., backend developer, frontend developer, full-stack developer, etc.).
  6. Language Experience Count: Count the number of programming languages (LanguageHaveWorkedWith) the respondent has experience with.
  7. Database Experience Count: Count the number of databases (DatabaseHaveWorkedWith) the respondent has experience with.
  8. Platform Experience Count: Count the number of platforms (PlatformHaveWorkedWith) the respondent has experience with.
  9. Miscellaneous Technology Count: Count the number of miscellaneous technologies (MiscTechHaveWorkedWith) the respondent has experience with.
  10. Office Stack Asynchronous Tools Count: Count the number of asynchronous office stack tools (OfficeStackAsyncHaveWorkedWith) the respondent has experience with.
  11. Stack Overflow Usage Frequency Encoding: Encode SOVisitFreq (frequency of visiting Stack Overflow) into numerical categories (e.g., daily = 3, weekly = 2, monthly = 1, rarely = 0)
jnehring commented 5 months ago

So much input. I just have an idea for the first one, how to encode "Years of Professional Coding Experience": Think about one hot encoding. Create a table such as

sample 1-2 years 3-4 years 5-6 years ...
1 0 1 0 0
2 1 0 0 0
3 0 0 1 0

So you encode every feature as a sparse vector with 0 in every place except for a 1 with the selection of the user in the survey. This works really well with linear regression and other models.

novik2713 commented 5 months ago

Question 1 and 2 are similar to https://github.com/novik2713/data-circle-team-C/issues/10 IMXO we have to use just one of three that have the strongest relation/corelation to Salary.