jschulberg / Federal-Sentencing

0 stars 0 forks source link

Figure out which Columns to Retain #3

Open jschulberg opened 2 years ago

jschulberg commented 2 years ago

So we have a serious issue where a dataset for a given year can sometimes have as many as 30k variables...unfortunately many of them are fully/mostly blank. So we either have to:

jschulberg commented 2 years ago

Backward Selection

Here's a breakdown of the 322 best populated columns in the FY2010 dataset:

Variable Percentage Null
AGE 0
ALTDUM 0
ALTMO 0
AMTFINEC 0
AMTREST 0
AMTTOTAL 0
CH5G13YN 0
CIRCDIST 0
COMDUM 0
COSTSDUM 0
COSTSUP 0
DAYSDUM 0
DISPOSIT 0
DISTRICT 0
DOBMON 0
DOBYR 0
DRUGMIN 0
DSIND 0
DSJANDC 0
DSPLEA 0
DSPSR 0
DSSOR 0
ECONDUM 0
FAILMIN 0
FINE 0
FINECDUM 0
FINEDUM 0
FINEWAIV 0
FIREMIN1 0
FIREMIN2 0
GUNMIN1 0
GUNMIN2 0
GUNMIN3 0
HISPORIG 0
HOMDUM 0
HRCOMSRV 0
IDMIN 0
IMMIMIN 0
INTDUM 0
IS1028A 0
IS924C 0
MITCAP 0
MOCOMCON 0
MOHOMDET 0
MOINTCON 0
MONCIRC 0
MONRACE 0
NEWCNVTN 0
NOCOUNTS 0
NOUSTAT 0
OFFTYPE2 0
ONLY1028A 0
ONLY924C 0
OTHRMIN 0
POOFFICE 0
PORNMIN 0
PRISDUM 0
PROBATN 0
PROBDUM 0
QUARTER 0
RELMIN 0
REPSXMIN 0
RESTDUM 0
SENSPLT0 0
SENTIMP 0
SENTMON 0
SENTYR 0
SEXMIN 0
SORFORM 0
SOURCES 0
SUPRDUM 0
SUPREL 0
TIMSERVD 0
TIMSERVM 0
TOTDAYS 0
TOTPRISN 0
TOTREST 0
TOTUNIT 0
TYPEMONY 0
TYPEOTHS 0
USSCIDN 0
WEAPON 0
YEARS 0
ACCGDLN 0.01
CITIZEN 0.01
GLMAX 0.01
GLMIN 0.01
MONSEX 0.01
NEWCIT 0.01
SENTTOT0 0.01
TIMESERV 0.01
XCRHISSR 0.01
XFOLSOR 0.01
XMAXSOR 0.01
XMINSOR 0.01
ZONE 0.01
BOOKER2 0.02
BOOKER3 0.02
BOOKERCD 0.02
CITWHERE 0.02
SMAX1 0.02
SMIN1 0.02
STATMAX 0.02
STATMIN 0.02
CRIMHIST 0.03
ENCRYPT1 0.03
ENCRYPT2 0.03
ABUS1 0.04
ABUSHI 0.04
ABUSS1 0.04
ABUSSHI 0.04
ACCAP 0.04
ACCTRESP 0.04
ADJOFL1 0.04
ADJOFLHI 0.04
AGGROL1 0.04
AGGROLHI 0.04
AMENDYR 0.04
BASADJ1 0.04
BASADJHI 0.04
BASE1 0.04
BASEHI 0.04
CAROFFAP 0.04
CHAP2 0.04
COADJLEV 0.04
CRIMLIV 0.04
CRIMPTS 0.04
CRPTS 0.04
FLIGHT1 0.04
FLIGHTHI 0.04
MITROL1 0.04
MITROLHI 0.04
MONACCEP 0.04
NOCOMP 0.04
OBSTRC1 0.04
OBSTRCHI 0.04
OFFVCT1 0.04
OFFVCTHI 0.04
POINT1 0.04
POINT2 0.04
POINT3 0.04
REL2PTS 0.04
RSTRVC1 0.04
RSTRVCHI 0.04
SENTPTS 0.04
SEXCAP 0.04
SEXOFFNA 0.04
SEXOFFNB 0.04
TEROR1 0.04
TERORHI 0.04
TOTCHPTS 0.04
USARM1 0.04
USARMHI 0.04
USKID1 0.04
USKIDHI 0.04
VIOL1PTS 0.04
VULVCT1 0.04
VULVCTHI 0.04
WEAPSOC 0.04
FALDM1 0.05
FALDMHI 0.05
NEWRACE 0.05
RLEAS1 0.05
RLEASHI 0.05
PRESENT 0.06
ADJ_B1 0.07
ADJ_BHI 0.07
NUMDEPEN 0.08
EDUCATN 0.09
NEWEDUC 0.09
SENSPLT 0.09
SENTTOT 0.12
MAND1 0.14
SPECASSM 0.16
ARMCRIM 0.19
CAROFFEN 0.19
SEXACCA 0.19
SEXACCB 0.19
RESTDET1 0.2
ADJ_C1 0.39
ADJ_CHI 0.39
ADJ_D1 0.39
ADJ_DHI 0.39
RANGEPT 0.4
ADJ_E1 0.45
ADJ_EHI 0.45
ADJ_F1 0.46
ADJ_FHI 0.46
ADJ_G1 0.47
ADJ_GHI 0.47
ADJ_H1 0.47
ADJ_HHI 0.47
ADJ_I1 0.62
ADJ_IHI 0.62
ADJ_J1 0.62
ADJ_JHI 0.62
REAS1 0.62
ADJ_K1 0.67
ADJ_KHI 0.67
ADJ_L1 0.67
ADJ_LHI 0.67
INOUT 0.7
COMBDRG2 0.74
DRUGTYP1 0.74
SAFE 0.74
SAFETY 0.74
UNIT1 0.74
jschulberg commented 2 years ago

You can see my attempts at forward selection...it's pretty tedious, but honestly a good way to learn more about the data.

jschulberg commented 2 years ago

Note: Check Appendix C in the USSC Codebook for the grouping of different variables, like demographic variables.