kavigupta / urbanstats

A website for viewing statistics of various areas in the United States.
https://urbanstats.org/
12 stars 2 forks source link

Comparison broken for Dublin CA and Ireland #393

Open lukebrody opened 5 days ago

lukebrody commented 5 days ago

https://urbanstats.org/comparison.html?longnames=%5B%22Dublin+city%2C+California%2C+USA%22%2C%22Dublin+Urban+Center%2C+Ireland%22%5D

Dublin CA shows NaN in stats

image

kavigupta commented 5 days ago

Okay so for some context on this issue, there are two sources of 2020 population data

  1. The US Census 2020. This data set is quite good quality, and extremely granular. The way we process it (as block centroids) it takes the form of a dense web of points each with some number of people in them. This data set only exists for the 50 states + DC + PR.
  2. GHS gridded population data 2020. This data set is estimated and flawed in many ways, but is the best data set for regions outside the US.

We use the Census data for all American regions and the GPW data for all international regions, as well as international-comparable american regions (states, urban centers). For these regions, two numbers are computed.

A further complication is that census data is used to compute the population of certain regions that are partially inside the United States, which effectively computes the population of only the US portion. This is disclaimed along with other statistics using a parenthetical. See for example: https://urbanstats.org/article.html?longname=Tijuana+Urban+Center%2C+Mexico-USA

We would like to satisfy the following properties:

  1. Always compare like to like when possible. The gpw data set has certain systematic biases in it, and I think in practice it is better to preserve these biases for any comparison
  2. Show the most accurate kind of data in the default view of a page
  3. Disclaim clearly whenever data is non-Census or not for an entire region
  4. Do not throw up NaNs or have blank rows unless absolutely necessary

We are not currently doing # 4 here, while accomplishing all other goals. I think we could accomplish 4 with some kind of additional disclaimer for non-like-for-like comparisons.

Separately we are not currently doing # 2 here, see for example the Tijuana example.