Closed twitte01 closed 6 months ago
One way to handle categorical variables is to use one-hot encoding. One-hot encoding transforms categorical variables into a set of binary features, where each feature represents a distinct category. For example, suppose we have a categorical variable “color” that can take on the values red, blue, or yellow. We can transform this variable into three binary features, “color-red,” “color-blue,” and “color-yellow,” which can only take on the values 1 or 0. This increases the dimensionality of the space, but it allows us to use any clustering algorithm we like.
It is important to note that one-hot encoding is only suitable for nominal data, which does not have an inherent order. For ordinal data, such as “bad,” “average,” and “good,” it may be more appropriate to use a numerical encoding, such as 0, 1, and 2, respectively.
EMPSTAT: (Employment status [general version]) Normalize
EMPSTATD: (Employment status [detailed version]) Next Steps
CLASSWKR: (Class of worker [general version]) Class of work describes whether a individual is self-employed or employed by a corporation reported as a numeric key.
CLASSWKRD: (Class of worker [detailed version]) Next Step
UHRSWORK: (Usual hours worked per week) Normalize
LOOKING: (Looking for work) combine N/a, not reported; normalize
INCTOT: (Total personal income) Normalize
FTOTINC: (Total family income) Normalize
INCWELFR: (Welfare (public assistance) income) Normalize
INCINVST: (Interest, dividend, and rental income) Normalize
POVERTY: (Poverty status) Normalize
CHOSEN
from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))
print(scaler.data_max_)
print(scaler.transform(data))
print(scaler.transform([[2, 2]]))
YEAR: Normalize
SAMPLE: (IPUMS sample identifier) Remove
SERIAL: (Household serial number) Remove
CBSERIAL: (Original Census Bureau household serial number) Remove
HHWT: (Household weight) Remove
HHTYPE: (Household Type) Combine 9 & 0; normalize
CLUSTER: (Household cluster for variance estimation) REMOVE
CPI99: (CPI-U adjustment factor to 1999 dollars) REMOVE
STRATA: (Household strata for variance estimation) REMOVE
@CanIGetAnAman I noted some variables from each dataset that i removed b/c they would need some feature engineering to be useful
@vitush99 Here is the preprocessing I did!!
add detailed issues when determined