ummel / fusionData

Data backend for fusionACS platform
https://ummel.github.io/fusionData/
GNU General Public License v3.0
2 stars 1 forks source link

[Optional] Write code to process 2019 ASEC person data #65

Closed ummel closed 2 years ago

ummel commented 2 years ago

Depending on complexity of raw data, may want to skip ingestion of individual person records and use the household data initially. Eventually, we DO want the person records as well, but we can get going just with household if you prefer (then come back later).

If person records ARE ingested, follow same general advice as for household data but using "P" identifier for file instead of "H".

heroashman commented 2 years ago

To my understanding, the CPS-ASEC consists of two parts: Core questions and ASEC questions. You always want the Core questions with the ASEC questions because the Core contains stuff like demographics, weights, location etc.

The Core questions contain both household and person level data. The ASEC questions only contain person level data (these are more detailed variables on work, income, insurance, poverty etc.). Because the ASEC is only at the person level, I assume we want to do both the person data and the household data in the ingestion?

ummel commented 2 years ago

OK, this is good to know. In this case, the person data isn't optional -- we have to have it. So go ahead and work on ingesting it at first pass. We may eventually want to aggregate the person-level variables to household-level for the purposes of fusion (we can fuse either household- or person-level variables, since ACS PUMS contains both, but we've been working with solely HH so far). For ingestion and harmonization, we will want the Core household variables ("H") and the person-level ASEC ("P"). Does this make sense?

heroashman commented 2 years ago

@ummel That makes sense, thanks!

heroashman commented 2 years ago

Code for processing 2019 ASEC person data is complete, and both ASEC_2019_P_dictionary.RDS and ASEC_2019_P_processed.fst have been created locally. I will wait to commit them until more checks are done and the universal dictionary is updated.

heroashman commented 2 years ago

@ummel There is a lot of item non-response in the person-level ASEC, especially for income variables. All the non-response values are imputed by the Census using a hot-deck allocation method where they assign observed values of a variable based on other characteristics of the respondent (see details here). For every variable that has some values imputed, there is a variable quality flag indicating which values were imputed, which I've included in the ASEC data we have.

My question is, should we ignore the Census's allocated values, and instead treat them as missing and impute them ourselves? Or should we keep the Census's allocated values and not re-impute these missing values?