twitte01 / 232R_GroupProject

UCSD Spring 2024 232R Big Data Analytics Using Spark Group Project
0 stars 2 forks source link

Determine scaling and transformation needed #13

Closed twitte01 closed 6 months ago

twitte01 commented 6 months ago

add detailed issues when determined

twitte01 commented 6 months ago

One-hot encoding

One way to handle categorical variables is to use one-hot encoding. One-hot encoding transforms categorical variables into a set of binary features, where each feature represents a distinct category. For example, suppose we have a categorical variable “color” that can take on the values red, blue, or yellow. We can transform this variable into three binary features, “color-red,” “color-blue,” and “color-yellow,” which can only take on the values 1 or 0. This increases the dimensionality of the space, but it allows us to use any clustering algorithm we like.

It is important to note that one-hot encoding is only suitable for nominal data, which does not have an inherent order. For ordinal data, such as “bad,” “average,” and “good,” it may be more appropriate to use a numerical encoding, such as 0, 1, and 2, respectively.

twitte01 commented 6 months ago

Individual Census

Technical Variables

Household Variables
Demographic Variables
Education Variables
Health Variables
Employment & Income Variables
twitte01 commented 6 months ago

Scaling Functions

MinMax Scaler

CHOSEN

Max ABS Scaler

Robust Scaler

Standart Scaler

twitte01 commented 6 months ago

Household Census Variables

Technical Variables

Geographic Variables

Economic Characteristics

Appliance, Mechanical, Other Variables

Houshold Composition Variables

twitte01 commented 6 months ago

@CanIGetAnAman I noted some variables from each dataset that i removed b/c they would need some feature engineering to be useful

twitte01 commented 5 months ago

@vitush99 Here is the preprocessing I did!!