socialfoundations / folktables

Datasets derived from US census data
MIT License
234 stars 20 forks source link

ACSPublicCoverage Income Threshold #14

Closed britneyting closed 2 years ago

britneyting commented 2 years ago

Hi, thank you for creating these datasets! I am currently looking into ACSPublicCoverage, and I have a few questions about the criteria used in its creation:

  1. Why was $30k chosen as the cutoff for ACSPublicCoverage? 'Low-income' status varies depending on state, number of household members, etc. Applying a $30k threshold to every state when each state has a different cost of living doesn't make sense to me, and I was unable to find a justification for this cutoff in the paper.

  2. ACSPublicCoverage focuses on individuals, but low-income status also depends on household size, which isn't one of the 19 features. On the other hand, the dataset includes "ESP," or "employment status of parents," but the individual isn't always living with parents, so I am concerned that the inclusion of ESP but lack of household size can affect model training/predictions.

I would love to hear your thoughts on this matter - thank you again!

Tagging people who are also interested in this: @romanlutz, @imatiach-msft, @kspieks

francesding commented 2 years ago

Hi, thanks for the questions!

You're absolutely right that designation as "low income" usually depends on household size, and also that for certain programs, it may also depend on the state or local cost of living estimates. Unfortunately the individual level records that we access with ACSPublicCoverage don't have a feature for household size, so that is the reason we didn't include this feature. I believe it's possible to link individual-level records with household-level records (see page 13 here) so that we can add a household size feature to each individual, but we kept the scope of the first iteration of folktables limited to the individual person records from the Census.

We also set a single income threshold rather than state-specific thresholds to keep the example prediction problem relatively simple. Our main goal with the five prediction problems included in folktables was to illustrate the range of target variables made available in Census data and provide examples for how to define new problems of interest. There are many subtleties to the factors that contribute to whether individuals are more or less likely to have public health insurance, as you point out, and ACSPublicCoverage is definitely not designed to produce predictive models that allow for interpretation of those factors. That being said, if it would help to have state-specific filtering criteria added to the definition of ACSPublicCoverage, I'd be happy to add that in and discuss with you what criteria would be most appropriate!

As for the $30,000 income threshold, we arrived at that number by starting with the average US household size (2.53 people), rounding that up to 3, finding that the Federal Poverty Line in 2021 for 3 person households was $21,960, and then multiplying that number by 1.33 since the Affordable Care Act allowed states to extend Medicaid eligibility to adults with income up to 133% of the Federal Poverty Line. That results in a total of $29,206.8, which we rounded up to $30,000. As I mentioned above though, since this was a very rough approximation, we're very open to modifying the prediction problem, or creating a more detailed, state-specific one, if you have that need.

For your second question, we didn't have a particular reason for including "employment status of parents", other than that it could possibly be predictive for individuals who are still living with their parents. In general, we thought it was interesting from a machine learning perspective to include both features that we intuitively thought would be predictive, and features that we didn't always expect to be predictive, since that is the case for many machine learning applications that indiscriminately use all the data available to them. It seems plausible that the inclusion of the ESP variable could make some models less accurate or make the model internals less interpretable -- it seems interesting to study which models are more vulnerable in this way, or also how the ESP variable could be flagged, as part of data cleaning.

Hope that was helpful -- thanks again for the thoughtful questions!