Open pabloarosado opened 1 week ago
Dataset Quality Checker Proposal
This is a rough idea for a basic dataset quality checker. The goal is to see how technically challenging it might be while helping us spot important anomalies we can actually do something about. We're looking for things like fat-finger errors, out-of-range values, or extreme differences between countries. We're not trying to catch more complex stuff like regime shifts or changes in definitions—those are a bit too advanced for now.
Calculate statistics:
For each indicator in the dataset, let's calculate some basic stats—min, max, mean, etc. We'll also look at differences between data points to catch any sudden jumps. Maybe we'll throw in some other random checks to see what sticks.
Add Some Context:
Attach relevant metadata for each indicator (like unit, description, etc.) so we can better understand the numbers.
Ask ChatGPT:
Feed all this info into ChatGPT and ask it to flag any weird indicators or countries. While we're at it, we can also ask for countries that might be useful for comparison.
Visualize the Results:
Once we have the flagged data, let's plot it as charts so we can easily see what's going on.
This could be relatively easy to pull out for a few datasets and see how well does it work. Ideally, we'd have a list of real anomalies to test it against, but if not, we can just artificially mess up data.
We create a new dedicated page in wizard for anomaly detection. To begin with, this will be a table with one row per indicator, among the list of indicators from datasets that have changed in a given PR.
We need a new dedicated page for anomaly detection.
We need the code to execute anomaly detection at the individual indicator level, and when comparing indicator versions.
We may also need to look into how to handle variable mappings. In the easiest version, we just store a json file locally, but in the long term it would be good to be able to "restore" mappings in indicator upgrader, and store mappings in the grapher database.
We thought we could roughly divide the work as follows:
To add on Pablo's summary:
How I see the data manager workflow:
Step 2 generates an output summary of the anomalies, which is then used in the streamlit app (step 3). We should agree on its format
anomalies:
- indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_1"
- description: "Sudden spike in ..."
- description: "Sudden drop in ..."
- description: "Missing values in period ..."
- indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_2"
- description: "Values way higher in ..."
indicator_old_slug: "grapher/bla/2023-01-01/bla/bla#indicator_2"
Note: some anomalies could be just for one indicator, and others could be for that indicator relative to its old version (see indicator_old_slug
).
Optional: We could try to pin-point country & years for single anomaly (i.e. 'when that anomaly happened'), though that could be either 'a single country-year', 'a single country for a year period', 'multiple countries and years', etc. So I am unsure about the format at the moment.
From our discussion:
We will probably use a ChatGPT-based approach. A major challenge is packing the dataset into the 128k context window. We could:
We can experiment with an approach based on percentage change.
If all of this went well, we could consider:
Today we had a long discussion where we discussed which are the tasks pending and how to organize ourselves to work on this project.
From our meeting, we concluded that there are three main parts:
Some other important points:
Other pending discussion points
Summary
We should perform better data quality checks, and ideally have tools to help us identify data anomalies, as part the normal flow of our data work.
Problem
We currently perform sanity checks via assertions or with ad-hoc code outside of ETL, or via visual inspection on Indicator Upgrader explorer tool, or chart diff. But many data issues still remain in our data, and often users point them out.
In most cases, the issues were "not our fault", since they were already in the original data. However, we should, at least, be aware of these issues, contact data providers early on, and fix them when possible.
Impact
These data issues can lead to the following undesired outcomes:
Scope
List of PRs