twitte01 commented 6 months ago

add detailed issues when determined

twitte01 commented 6 months ago

One-hot encoding

One way to handle categorical variables is to use one-hot encoding. One-hot encoding transforms categorical variables into a set of binary features, where each feature represents a distinct category. For example, suppose we have a categorical variable “color” that can take on the values red, blue, or yellow. We can transform this variable into three binary features, “color-red,” “color-blue,” and “color-yellow,” which can only take on the values 1 or 0. This increases the dimensionality of the space, but it allows us to use any clustering algorithm we like.

It is important to note that one-hot encoding is only suitable for nominal data, which does not have an inherent order. For ordinal data, such as “bad,” “average,” and “good,” it may be more appropriate to use a numerical encoding, such as 0, 1, and 2, respectively.

twitte01 commented 6 months ago

Individual Census

Technical Variables

Year: Normalize
SAMPLE (IPUMS sample identifier) REMOVE
SERIAL (Household serial number) REMOVE
CBSERIAL: (Original Census Bureau household serial number) REMOVE
HHWT: (Household weight) REMOVE
PERNUM: (Person number in sample unit) Next Steps
CBPERNUM: (Original Census Bureau Person number in sample unit) REMOVE
CLUSTER: (Household cluster for variance estimation) REMOVE
CPI99: (CPI-U adjustment factor to 1999 dollars) REMOVE
STRATA: (Household strata for variance estimation) REMOVE

Household Variables

PERWT: (Person weight) Next Steps
FAMSIZE: (Number of own family members in household) Binary
GQ: (Group quarters status) One-hot encoding --> Normalize combine to Households and Group Quarters

Demographic Variables

SEX: (Sex) [Current Encoding: 1: Male, 2: Female, 3: Missing] [New Encoding: 0:Male, 1: Female] No missing
AGE: (Age) Normalize
MARST: (Marital status) No missing; Normalize; current order sufficient
1. Married, spouse present
2. Married, spouse absent
3. Seperated
4. Divorced
5. Widowed
6. Never married/single
7. Blank/missing
RACE: (Race [general version]) Normalize
1. White
2. Black/African American
3. American Indian or Alaskan Native
4. Chinese
5. Japanese
6. Pacific Islander or other asian
7. Other race
8. Two major races
9. Three or more major races
RACED: (Race [detailed version]) Next Step
CITIZEN: (Citizenship status) Normalize
1. N/a
2. Born abroad American parents
3. Naturalized
4. Not a citizen

Education Variables

SCHOOL: (School attendance) Normalize
1. N/a
2. Not in school
3. Yes in school
4. Unknown - 0 values
5. Missing - 0 values
EDUC: (Educational attainment [general version]) Normalize
1. N/a or no schooling
2. Nursery - Grade 4
3. Grades 5 - 8
4. Grade 9
5. Grade 10
6. Grade 11
7. Grade 12
8. 1 year of college
9. 2 years of college
10. 3 years of college
11. 4 years of college
12. 5+ years of college
13. Missing
EDUCD: (Educational attainment [detailed version]) Next Steps
SCHLTYPE: (Public or private school) Normalize
1. N/a
2. Not enrolled in school
3. Public school
4. Private school
5. Church-related - 0 cases
6. Parochial - 0 cases
7. Other private (1980) - 0 cases
8. Other private (1970) - 0 cases

Health Variables

HCOVANY: (Any health insurance coverage) Normalize;
1. No health insurance
2. Has health insurance

Employment & Income Variables

EMPSTAT: (Employment status [general version]) Normalize
1. N/a
2. Employed
3. Unemployed
4. Not in labor force
5. Unknown/ illegible - 0 cases
EMPSTATD: (Employment status [detailed version]) Next Steps
1. N/a
2. At work
3. At work, public emergency
4. Has job, not working
5. Armed forces
6. Armed forces, at work
7. Armed forces, not at work but with job
8. Unemployed
9. Unemployed, experience worker
10. Unemployed, new worker
11. Not in Labor Force
12. Not in Labor Force, housework
13. Not in Labor Force, unable to work
14. Not in Labor Force, school
15. Not in Labor Force, other
16. Unknown/illegible
CLASSWKR: (Class of worker [general version]) Class of work describes whether a individual is self-employed or employed by a corporation reported as a numeric key.
1. N/a
2. Self-employed
3. Works for wages
4. Unknown - 0 cases
CLASSWKRD: (Class of worker [detailed version]) Next Step
1. N/a
2. Self-employed
3. Employer
4. Working on own account
5. Self-employed, not incorporated
6. Self-employed, incorporated
7. Works for wages
8. Works for salary
9. Wage/salary, private
10. Wage/salary at non-profit
11. Wage/salary, goverment
12. Armed forces
13. State government employee
14. Local government employee
15. Unpaid family worker
16. Illegible
17. Unknown
UHRSWORK: (Usual hours worked per week) Normalize
LOOKING: (Looking for work) combine N/a, not reported; normalize
1. N/a
2. No
3. Yes
4. Not reported
INCTOT: (Total personal income) Normalize
FTOTINC: (Total family income) Normalize
INCWELFR: (Welfare (public assistance) income) Normalize
INCINVST: (Interest, dividend, and rental income) Normalize
POVERTY: (Poverty status) Normalize

twitte01 commented 6 months ago

Scaling Functions

MinMax Scaler

CHOSEN

It preserves the shape of the original distribution.
The default range for the feature returned by MinMaxScaler is 0 to 1.

The importance of outlier values doesn’t affect, so those can be used for outlier detection algorithms.

from sklearn.preprocessing import MinMaxScaler
data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))
print(scaler.data_max_)
print(scaler.transform(data))
print(scaler.transform([[2, 2]]))

Max ABS Scaler

The range of the data is between 0 to 1.
Features don’t affect each other. (if your dataset has more than one type of value, such as length and weight, they will not corrupt each other)

Robust Scaler

There is no default range to scale the data like MinMax Scaler.
It doesn’t work well for outlier detection.
It reduces the effects of outliers.
Value range is interquartile range (Quartile 1 to Quartile 3)

Standart Scaler

It works well at outlier detection.
Features need to be the same type. If one feature represents count and one represents length, that will cause a negative effect on data.
Normalizing your data will scale most of your data to a small interval if you have outliers in your feature.

twitte01 commented 6 months ago

Household Census Variables

Technical Variables

YEAR: Normalize
SAMPLE: (IPUMS sample identifier) Remove
SERIAL: (Household serial number) Remove
CBSERIAL: (Original Census Bureau household serial number) Remove
HHWT: (Household weight) Remove
HHTYPE: (Household Type) Combine 9 & 0; normalize
1. N/A
2. married-couple family household
3. Male household, no wife present
4. Female household, no husband present
5. Male household, living alone
6. Male household, not living alone
7. Female household, living alone
8. female household, not living alone
9. could not be determined
CLUSTER: (Household cluster for variance estimation) REMOVE
CPI99: (CPI-U adjustment factor to 1999 dollars) REMOVE
STRATA: (Household strata for variance estimation) REMOVE

Geographic Variables

STATEICP: (State (ICPSR code)) Normalize (should be one-hot encoding potentially)
MET2023: (Metropolitan area (2023 delineations, identifiable areas only)) Remove due to too many nulls 9 million out of 10 million

Economic Characteristics

MOBLHOME: (Annual mobile home costs) Normalize
TAXINCL: (Mortgage payment includes property taxes) feature engineering
INSINCL: (Mortgage payment includes property insurance) feature engineering
RENTGRS: (Monthly gross rent) normalize
CONDOFEE: (Monthly condominium fee) Normalize
MOBLHOME: (Annual mobile home costs) Normalize
HHINCOME: (Total household income) Normalize
FOODSTMP: (Food stamp recipiency) Normalize
1. N/a
2. No
3. Yes
VALUEH: (House value) Normalize

Appliance, Mechanical, Other Variables

COSTELEC: (Annual electricity cost) Normalize
COSTGAS: (Annual gas cost) Normalize
COSTWATR: (Annual water cost) Normalize
COSTFUEL: (Annual home heating fuel cost) Normalize
CINETHH: (Access to internet) Normalize
1. N/a
2. Yes with subscription
3. Yes without subscription
4. No
VEHICLES: (Vehicles available) Normalize

Houshold Composition Variables

GQ: (Group quarters status) Combine household & group quarters then normalize
1. Vacant unit
2. Household under 1970 definiton
3. Additional households under 1990 definition
4. Group quarters
5. Other group quarters
6. Additional households under 2000 definition
7. Fragment
FARM: (Farm status) Normalize
1. n/a
2. Non-Farm
3. Farm
4. Blank/missing
OWNERSHP: (Ownership of dwelling (tenure) Normalize
1. n/a
2. Owned or being bought (loan)
3. Rented
OWNERSHPD: (Ownership of dwelling (tenure) Normalize
1. N/a
2. Owned or being bought
3. Check mark (owns?)
4. Owned free and clear
5. Owned with mortgage or loan
6. Rented
7. No cash rent
8. With cash rent
COUPLETYPE: (Householder couple type) REMOVE too many nulls
1. n/a
2. Heterosexual married couple
3. Homosexual married couple
4. Heterosexual unmarried couple
5. Homosexual unmarried couple
NFAMS: (Number of families in household). Normalize

twitte01 commented 6 months ago

@CanIGetAnAman I noted some variables from each dataset that i removed b/c they would need some feature engineering to be useful

twitte01 commented 5 months ago

@vitush99 Here is the preprocessing I did!!

twitte01 / 232R_GroupProject

Determine scaling and transformation needed #13

One-hot encoding

Individual Census

Technical Variables

Household Variables

Demographic Variables

Education Variables

Health Variables

Employment & Income Variables

Scaling Functions

MinMax Scaler

Max ABS Scaler

Robust Scaler

Standart Scaler

Household Census Variables

Technical Variables

Geographic Variables

Economic Characteristics

Appliance, Mechanical, Other Variables

Houshold Composition Variables