Open jschulberg opened 2 years ago
Here's a breakdown of the 322 best populated columns in the FY2010 dataset:
Variable | Percentage Null |
---|---|
AGE | 0 |
ALTDUM | 0 |
ALTMO | 0 |
AMTFINEC | 0 |
AMTREST | 0 |
AMTTOTAL | 0 |
CH5G13YN | 0 |
CIRCDIST | 0 |
COMDUM | 0 |
COSTSDUM | 0 |
COSTSUP | 0 |
DAYSDUM | 0 |
DISPOSIT | 0 |
DISTRICT | 0 |
DOBMON | 0 |
DOBYR | 0 |
DRUGMIN | 0 |
DSIND | 0 |
DSJANDC | 0 |
DSPLEA | 0 |
DSPSR | 0 |
DSSOR | 0 |
ECONDUM | 0 |
FAILMIN | 0 |
FINE | 0 |
FINECDUM | 0 |
FINEDUM | 0 |
FINEWAIV | 0 |
FIREMIN1 | 0 |
FIREMIN2 | 0 |
GUNMIN1 | 0 |
GUNMIN2 | 0 |
GUNMIN3 | 0 |
HISPORIG | 0 |
HOMDUM | 0 |
HRCOMSRV | 0 |
IDMIN | 0 |
IMMIMIN | 0 |
INTDUM | 0 |
IS1028A | 0 |
IS924C | 0 |
MITCAP | 0 |
MOCOMCON | 0 |
MOHOMDET | 0 |
MOINTCON | 0 |
MONCIRC | 0 |
MONRACE | 0 |
NEWCNVTN | 0 |
NOCOUNTS | 0 |
NOUSTAT | 0 |
OFFTYPE2 | 0 |
ONLY1028A | 0 |
ONLY924C | 0 |
OTHRMIN | 0 |
POOFFICE | 0 |
PORNMIN | 0 |
PRISDUM | 0 |
PROBATN | 0 |
PROBDUM | 0 |
QUARTER | 0 |
RELMIN | 0 |
REPSXMIN | 0 |
RESTDUM | 0 |
SENSPLT0 | 0 |
SENTIMP | 0 |
SENTMON | 0 |
SENTYR | 0 |
SEXMIN | 0 |
SORFORM | 0 |
SOURCES | 0 |
SUPRDUM | 0 |
SUPREL | 0 |
TIMSERVD | 0 |
TIMSERVM | 0 |
TOTDAYS | 0 |
TOTPRISN | 0 |
TOTREST | 0 |
TOTUNIT | 0 |
TYPEMONY | 0 |
TYPEOTHS | 0 |
USSCIDN | 0 |
WEAPON | 0 |
YEARS | 0 |
ACCGDLN | 0.01 |
CITIZEN | 0.01 |
GLMAX | 0.01 |
GLMIN | 0.01 |
MONSEX | 0.01 |
NEWCIT | 0.01 |
SENTTOT0 | 0.01 |
TIMESERV | 0.01 |
XCRHISSR | 0.01 |
XFOLSOR | 0.01 |
XMAXSOR | 0.01 |
XMINSOR | 0.01 |
ZONE | 0.01 |
BOOKER2 | 0.02 |
BOOKER3 | 0.02 |
BOOKERCD | 0.02 |
CITWHERE | 0.02 |
SMAX1 | 0.02 |
SMIN1 | 0.02 |
STATMAX | 0.02 |
STATMIN | 0.02 |
CRIMHIST | 0.03 |
ENCRYPT1 | 0.03 |
ENCRYPT2 | 0.03 |
ABUS1 | 0.04 |
ABUSHI | 0.04 |
ABUSS1 | 0.04 |
ABUSSHI | 0.04 |
ACCAP | 0.04 |
ACCTRESP | 0.04 |
ADJOFL1 | 0.04 |
ADJOFLHI | 0.04 |
AGGROL1 | 0.04 |
AGGROLHI | 0.04 |
AMENDYR | 0.04 |
BASADJ1 | 0.04 |
BASADJHI | 0.04 |
BASE1 | 0.04 |
BASEHI | 0.04 |
CAROFFAP | 0.04 |
CHAP2 | 0.04 |
COADJLEV | 0.04 |
CRIMLIV | 0.04 |
CRIMPTS | 0.04 |
CRPTS | 0.04 |
FLIGHT1 | 0.04 |
FLIGHTHI | 0.04 |
MITROL1 | 0.04 |
MITROLHI | 0.04 |
MONACCEP | 0.04 |
NOCOMP | 0.04 |
OBSTRC1 | 0.04 |
OBSTRCHI | 0.04 |
OFFVCT1 | 0.04 |
OFFVCTHI | 0.04 |
POINT1 | 0.04 |
POINT2 | 0.04 |
POINT3 | 0.04 |
REL2PTS | 0.04 |
RSTRVC1 | 0.04 |
RSTRVCHI | 0.04 |
SENTPTS | 0.04 |
SEXCAP | 0.04 |
SEXOFFNA | 0.04 |
SEXOFFNB | 0.04 |
TEROR1 | 0.04 |
TERORHI | 0.04 |
TOTCHPTS | 0.04 |
USARM1 | 0.04 |
USARMHI | 0.04 |
USKID1 | 0.04 |
USKIDHI | 0.04 |
VIOL1PTS | 0.04 |
VULVCT1 | 0.04 |
VULVCTHI | 0.04 |
WEAPSOC | 0.04 |
FALDM1 | 0.05 |
FALDMHI | 0.05 |
NEWRACE | 0.05 |
RLEAS1 | 0.05 |
RLEASHI | 0.05 |
PRESENT | 0.06 |
ADJ_B1 | 0.07 |
ADJ_BHI | 0.07 |
NUMDEPEN | 0.08 |
EDUCATN | 0.09 |
NEWEDUC | 0.09 |
SENSPLT | 0.09 |
SENTTOT | 0.12 |
MAND1 | 0.14 |
SPECASSM | 0.16 |
ARMCRIM | 0.19 |
CAROFFEN | 0.19 |
SEXACCA | 0.19 |
SEXACCB | 0.19 |
RESTDET1 | 0.2 |
ADJ_C1 | 0.39 |
ADJ_CHI | 0.39 |
ADJ_D1 | 0.39 |
ADJ_DHI | 0.39 |
RANGEPT | 0.4 |
ADJ_E1 | 0.45 |
ADJ_EHI | 0.45 |
ADJ_F1 | 0.46 |
ADJ_FHI | 0.46 |
ADJ_G1 | 0.47 |
ADJ_GHI | 0.47 |
ADJ_H1 | 0.47 |
ADJ_HHI | 0.47 |
ADJ_I1 | 0.62 |
ADJ_IHI | 0.62 |
ADJ_J1 | 0.62 |
ADJ_JHI | 0.62 |
REAS1 | 0.62 |
ADJ_K1 | 0.67 |
ADJ_KHI | 0.67 |
ADJ_L1 | 0.67 |
ADJ_LHI | 0.67 |
INOUT | 0.7 |
COMBDRG2 | 0.74 |
DRUGTYP1 | 0.74 |
SAFE | 0.74 |
SAFETY | 0.74 |
UNIT1 | 0.74 |
You can see my attempts at forward selection...it's pretty tedious, but honestly a good way to learn more about the data.
Note: Check Appendix C in the USSC Codebook for the grouping of different variables, like demographic variables.
So we have a serious issue where a dataset for a given year can sometimes have as many as 30k variables...unfortunately many of them are fully/mostly blank. So we either have to: