vega / vega-datasets

Common repository for example datasets used by Vega-related projects
251 stars 206 forks source link

Address data inconsistencies and absence of versioning or sourcing in gapminder data #577

Closed dsmedia closed 16 hours ago

dsmedia commented 1 week ago

Gapminder data, from a Swedish non-profit, is a popular part of this repository and truly fascinating to explore using visualization tools in the vega ecosystem. While working on an Altair example, I discovered what looked like a simple issue in the gapminder.json dataset, but as I looked into fixing it with a simple pull request, the right solution seemed a bit more complex, and I wanted to lay out my thoughts here for feedback.

The immediate issue I found is that it looks like life expectancy data between North and South Korea has been swapped. For 2005, this repository's dataset shows South Korea's life expectancy as 67.297 years and North Korea's as 78.623 years. This contradicts current Gapminder life expectancy data (v14), which reports approximately the reverse. This raises questions about other errors lurking in the dataset.

Resolving this issue is complicated by the absence of sourcing or versioning details for the gapminder data in SOURCES.md. The json file in this repository appears to be based on an older version of the dataset that I could not locate. For instance, Afghanistan's 1955 life expectancy is 30.332 years in the vega-datasets json, which aligns closely with Gapminder's v11 data (32.48 years), but differs from the current v14 (43.88 years).

Given what the vega-datasets README states about versioning, there seem to be a few options for a solution:

  1. Patch release: If the Korea data swap is confirmed as a formatting error, it could potentially be addressed in a patch release. That said, I still haven't been able to locate an older version of a Gapminder file containing data that matches the vega-datasets json.

  2. Minor release: Updating the dataset with current Gapminder figures without changing field names or file names could be done in a minor release. This could address the outdated data issue. But the data could be significantly different (as in the Afghanistan life expectancy data) and some country names may have changed.

  3. Major release: If we need to change field names (e.g., updating regional classification field name "cluster" to align with current Gapminder terminology) or significantly alter file contents, a major release would be necessary.

Regardless of the chosen approach, I propose:

  1. Considering whether to add a disclaimer of some kind in the repository about the intended / appropriate use cases for the data (given the repository can have errors, may be out of date, isn't actively maintained, that it's more for demo purposes) and/or encouraging that non-demonstration use cases refer back to the original sources rather than rely on the vega-datasets repository.
  2. Considering how best to adhere to appropriate sourcing requirements for datasets, such as attribution. Gapminder's license page lists attribution requirements.
  3. Updating SOURCES.md with detailed sourcing information
  4. It is also worth considering the handling of the other gapminder file in vega-datasets, gapminder-health-income.csv, which I haven't looked at.
domoritz commented 1 week ago

Let's add a comment. Something to the extent of https://github.com/vega/vega-datasets/issues/111#issuecomment-558155585. We can still update the datasets but let's at least use a minor version bump so that we don't accidentally break test cases that rely on exact values.

dsmedia commented 1 week ago

Thanks. Given the possibility that the Korea data issues were added intentionally for instructional purposes (as noted for other datasets here, here, and here) perhaps we leave the data file as is, and just add a data usage note in the README.txt (and SOURCES.md?).

There's probably a case for placing this note prominently (higher up) in the docs given the acknowledgement here that some are using this repository in unintended ways.

Maybe something like the below? I'd be happy to open a PR for this if you think it would be helpful.

Data Usage Note

These datasets are intended only for instructional and demonstration purposes. Datasets may contain intentional inconsistencies or errors to provide opportunities for data cleaning exercises and to illustrate common data quality issues.

domoritz commented 1 week ago

Let's add a data usage note to the readme only. Yes, please send a pull request.

I think we can still update the gap minder data to a known version number than we can link to. I think that would be worth doing a minor version bump and I'd love if you could send a pull request that updates the dataset and SOURCES.md accordingly since right now it's empty.

dsmedia commented 1 week ago

The remaining two tasks will be addressed together in a separate pull request.

dsmedia commented 1 week ago

Before the minor version bump, I wanted to highlight the significance of some of the revisions made by Gapminder to its demographic data since the last update in this repo nine years ago. I've prepared visualizations for review (seemed appropriate given the audience!) prior to submitting the PR, since the data changes may flow downstream to many existing charts that rely on the dataset. Some of the changes (South Korea and North Korea) appear to fix errors in the series; others reflect new estimates from sources deemed credible by Gapminder, particularly around major world events that have had a sizable impact on life expectancy at birth. There do not appear to be annotated explanations for each of the major revisions in the Gapminder data series. I also figure this may be a helpful exercise for others considering updating any of the vega-datasets series in the future.

The scatter plots below show countries with notable revisions in life expectancy and fertility data, two of the three data series in this repo's gapminder.json. Countries are included if they have at least one year with a "significant" deviation (defined arbitrarily by me as 5+ years for life expectancy, or +0.75 babies per woman for fertility) between old and revised data. Points represent 5-year intervals from 1955 to 2005.

Out of 63 countries in the series, most have smaller deviations than those shown; those are not shown here. There are also revisions to population data, but I've not shown them here.

image

image

As a sanity check, I made a a quick comparison with the World Bank data for Afghanistan's life expectancy at birth. This data is a closer fit with the revised Gapminder data than the nine-year-old version we had.

image

-- Aruba is no longer in the series (In the PR I plan to substitute a new country for Aruba to keep the country count the same) -- Hong Kong will be renamed "Hong Kong, China" in line with the new Gapminder format. -- I plan to add 2010 and 2015 data (two extra rows per country)

domoritz commented 6 days ago

Thanks for the detailed analysis. I think we should stay true to the original date if possible rather than augmenting/modifying it ourselves. So I would say let's not substitute but update the data, names, and rows.