theonaunheim / surgeo

Open Source Proxy Demographic module written in Python
MIT License
32 stars 16 forks source link

Update with 2020 Census data #20

Open joristaglio opened 2 years ago

joristaglio commented 2 years ago

Hi all! This is an awesome tool, thanks for building this.

Now that 2020 Census data is available, is it possible to update the data this pulls from? I'm happy to help in any way, including data cleaning and making it an optional keyword to prevent people from having their surgeo predictions change unexpectedly.

Any information you have about where you sourced the data/any special data cleaning you needed to format it would be helpful, and I can open a pull request with full test coverage as well.

praveenjaikant commented 1 year ago

Hey @joristaglio! Do you know where I could find the 2020 Census data? Please let me know, thanks.

lydubs commented 1 year ago

Hey @TheCleric, @theonaunheim, @nicanor-b - I was also wondering if there were plans to update the "prob_" datasets with 2020 Census data. Happy to help out. Thanks so much!

TheCleric commented 1 year ago

I believe the last time I checked I did not see the files we would need had been published by the census yet. If you have a link otherwise I'd love to know.

lydubs commented 1 year ago

@TheCleric, are these the files that you're looking for?

Overview page: https://www.census.gov/programs-surveys/decennial-census/about/rdo/summary-files/2020.html#P1 Files: https://www2.census.gov/programs-surveys/decennial/2020/data/01-Redistricting_File--PL_94-171/

It also looks like the US Census has made data available in a new format, if that's helpful: https://data.census.gov/table?q=race+and+ethnicity&tid=DECENNIALPL2020.P2

Let me know how I can help!

theonaunheim commented 1 year ago

Hey, @TheCleric: I haven't touched this in a year, which is a pretty good indication that I should bow out. Do you want to own this repo?

lydubs commented 1 year ago

Hey @TheCleric, I wanted to follow up on this and see how I can help! Thanks!

TheCleric commented 1 year ago

Sorry for the delay. @lydubs thanks for the links. Looking through them, I think this would be good to help us with a zip code update, but I don't see any data relating to surnames (which we would need to update the BISG/SurGeo model) nor for first names (which we would need, along with the surname data, to update the BIFSG model.

I suppose we could only update the Geocode model, but I wonder if it would be confusing to have 1 out of 5 models on 2020 data and the rest on 2010 data. I definitely don't think it should be wise to do something like updating the zip data for the BISG and BIFSG models without also updating surnames and first names. I think mixing years of data in like that could produce inaccuracies.

As well @theonaunheim, I would consider taking over if you no longer wish to maintain this.

lydubs commented 1 year ago

@TheCleric, my turn to apologize for the delay! That all makes sense to me - I wasn't able to find the 2020 Census survey data for names either. I would think that having more recent probabilities for race given location would better reflect reality even if surname was using more dated Census data though but I understand wanting to update everything at once.

TheCleric commented 1 year ago

I think my hesitance at this point is I'm not a data scientist, so I do not know which would be preferable:

1) To have the latest data we can have (even if it's across census iterations).

OR

2) To have consistent data across census iterations (even if slightly older).

I lean towards number 2 out of ignorance (as it's our current situation). My thinking, which could be wrong, is the whole point of the library is to make probabilistic inferences about data based on demographic information. If there is a shift in that demographic information (which is inevitable after 10 years), is it safe to use only a portion of that shift to come to an inference?

On one hand we may get closer to correct (since SOME of the data is recent), but we might also get further away from being correct (by only capturing part of the shift).

I honestly don't know what the correct call is here.

collinstarkweather commented 5 months ago

Roughly a year has transpired since the last comment, so I thought I would check in and see if there is any update on 2020 Census data availability.

I see from this Census announcement that the release of Detailed Demographic and Housing Characteristics File B (Detailed DHC-B) is anticipated in September 2024.

Will that provide the data necessary to update the library? If and when the data is available, is some help needed with the update?