owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
79 stars 21 forks source link

Create anomaly detection workflow #3340

Open pabloarosado opened 1 week ago

pabloarosado commented 1 week ago

Summary

We should perform better data quality checks, and ideally have tools to help us identify data anomalies, as part the normal flow of our data work.

Problem

We currently perform sanity checks via assertions or with ad-hoc code outside of ETL, or via visual inspection on Indicator Upgrader explorer tool, or chart diff. But many data issues still remain in our data, and often users point them out.

In most cases, the issues were "not our fault", since they were already in the original data. However, we should, at least, be aware of these issues, contact data providers early on, and fix them when possible.

Impact

These data issues can lead to the following undesired outcomes:

Scope

List of PRs

lucasrodes commented 1 week ago

Notes taken during our workshop in the offsite

image
Marigold commented 1 week ago

Dataset Quality Checker Proposal

This is a rough idea for a basic dataset quality checker. The goal is to see how technically challenging it might be while helping us spot important anomalies we can actually do something about. We're looking for things like fat-finger errors, out-of-range values, or extreme differences between countries. We're not trying to catch more complex stuff like regime shifts or changes in definitions—those are a bit too advanced for now.

  1. Calculate statistics:
    For each indicator in the dataset, let's calculate some basic stats—min, max, mean, etc. We'll also look at differences between data points to catch any sudden jumps. Maybe we'll throw in some other random checks to see what sticks.

  2. Add Some Context:
    Attach relevant metadata for each indicator (like unit, description, etc.) so we can better understand the numbers.

  3. Ask ChatGPT:
    Feed all this info into ChatGPT and ask it to flag any weird indicators or countries. While we're at it, we can also ask for countries that might be useful for comparison.

  4. Visualize the Results:
    Once we have the flagged data, let's plot it as charts so we can easily see what's going on.

This could be relatively easy to pull out for a few datasets and see how well does it work. Ideally, we'd have a list of real anomalies to test it against, but if not, we can just artificially mess up data.

pabloarosado commented 1 week ago

Proposal A

Summary

We create a new dedicated page in wizard for anomaly detection. To begin with, this will be a table with one row per indicator, among the list of indicators from datasets that have changed in a given PR.

Changes in wizard

We need a new dedicated page for anomaly detection.

New backend code for anomaly detection in wizard

We need the code to execute anomaly detection at the individual indicator level, and when comparing indicator versions.

Changes in indicator upgrader (optional)

We may also need to look into how to handle variable mappings. In the easiest version, we just store a json file locally, but in the long term it would be good to be able to "restore" mappings in indicator upgrader, and store mappings in the grapher database.

Distribution of tasks

We thought we could roughly divide the work as follows:

lucasrodes commented 1 week ago

To add on Pablo's summary:

lucasrodes commented 1 week ago

Workflow

How I see the data manager workflow:

  1. User adds new indicators to DB.
  2. Anomaly detector (AD) finds anomalies.
  3. Streamlit app presents these anomalies to the user.
  4. User decides what to do with it: Go back to ETL code and fix these, ignore the anomalies, etc.

Challenges

Possible structure of AD output

Step 2 generates an output summary of the anomalies, which is then used in the streamlit app (step 3). We should agree on its format

anomalies:
  - indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_1"
    - description: "Sudden spike in ..."
    - description: "Sudden drop in ..."
    - description: "Missing values in period ..."

  - indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_2"
    - description: "Values way higher in ..."
      indicator_old_slug: "grapher/bla/2023-01-01/bla/bla#indicator_2"

Note: some anomalies could be just for one indicator, and others could be for that indicator relative to its old version (see indicator_old_slug).

Optional: We could try to pin-point country & years for single anomaly (i.e. 'when that anomaly happened'), though that could be either 'a single country-year', 'a single country for a year period', 'multiple countries and years', etc. So I am unsure about the format at the moment.

larsyencken commented 1 week ago

From our discussion:

First version

The UI and workflow / Lucas

Anomalies within an indicator / Mojmir

We will probably use a ChatGPT-based approach. A major challenge is packing the dataset into the 128k context window. We could:

Anomalies relative to a baseline / Pablo

We can experiment with an approach based on percentage change.

Later options

If all of this went well, we could consider:

lucasrodes commented 6 days ago

Today we had a long discussion where we discussed which are the tasks pending and how to organize ourselves to work on this project.

From our meeting, we concluded that there are three main parts:

  1. Backend: Working on a rule-based anomaly detection tool for single indicators and indicator upgrades.
  2. Backend/DB: Make sense of the detected anomalies, group similar ones, and store them in a database.
  3. UI: Present the relevant anomalies to the user in a streamlit app.

Some other important points:

Other pending discussion points

Image Figma board