Create anomaly detection workflow

pabloarosado commented 1 week ago

Summary

We should perform better data quality checks, and ideally have tools to help us identify data anomalies, as part the normal flow of our data work.

Problem

We currently perform sanity checks via assertions or with ad-hoc code outside of ETL, or via visual inspection on Indicator Upgrader explorer tool, or chart diff. But many data issues still remain in our data, and often users point them out.

In most cases, the issues were "not our fault", since they were already in the original data. However, we should, at least, be aware of these issues, contact data providers early on, and fix them when possible.

Impact

These data issues can lead to the following undesired outcomes:

It's time consuming for data managers to add ad-hoc sanity checks
Users become aware of data quality issues that we didn't spot first, which erodes trust
We miss spurious data points that we might otherwise have removed, meaning spurious trends may be picked up and amplified by our users

Scope

[ ] #3369
[ ] #3370
[ ] #3371

List of PRs

lucasrodes commented 1 week ago

Notes taken during our workshop in the offsite

Marigold commented 1 week ago

Dataset Quality Checker Proposal

This is a rough idea for a basic dataset quality checker. The goal is to see how technically challenging it might be while helping us spot important anomalies we can actually do something about. We're looking for things like fat-finger errors, out-of-range values, or extreme differences between countries. We're not trying to catch more complex stuff like regime shifts or changes in definitions—those are a bit too advanced for now.

Calculate statistics:
For each indicator in the dataset, let's calculate some basic stats—min, max, mean, etc. We'll also look at differences between data points to catch any sudden jumps. Maybe we'll throw in some other random checks to see what sticks.
Add Some Context:
Attach relevant metadata for each indicator (like unit, description, etc.) so we can better understand the numbers.
Ask ChatGPT:
Feed all this info into ChatGPT and ask it to flag any weird indicators or countries. While we're at it, we can also ask for countries that might be useful for comparison.
Visualize the Results:
Once we have the flagged data, let's plot it as charts so we can easily see what's going on.

This could be relatively easy to pull out for a few datasets and see how well does it work. Ideally, we'd have a list of real anomalies to test it against, but if not, we can just artificially mess up data.

pabloarosado commented 1 week ago

Proposal A

Summary

We create a new dedicated page in wizard for anomaly detection. To begin with, this will be a table with one row per indicator, among the list of indicators from datasets that have changed in a given PR.

Changes in wizard

We need a new dedicated page for anomaly detection.

New backend code for anomaly detection in wizard

We need the code to execute anomaly detection at the individual indicator level, and when comparing indicator versions.

Changes in indicator upgrader (optional)

We may also need to look into how to handle variable mappings. In the easiest version, we just store a json file locally, but in the long term it would be good to be able to "restore" mappings in indicator upgrader, and store mappings in the grapher database.

Distribution of tasks

We thought we could roughly divide the work as follows:

@lucasrodes can take care of creating the streamlit app.
- Lucas can also think a bit more about how to handle variable mappings, and whether the changes in indicator upgrader are necessary.
@Marigold can take care of indicator-related anomaly detection, possibly using LLMs.
@pabloarosado can take care of the indicator versions comparison, e.g. comparing each of the old and new indicators in an updated dataset, and detecting abrupt changes.

lucasrodes commented 1 week ago

To add on Pablo's summary:

An extra motivation for working on something that keeps variable mappings is that we could use this information for other projects. Relating old → new indicators could enable things like pointing users to older indicator versions on our site, etc.

lucasrodes commented 1 week ago

Workflow

How I see the data manager workflow:

User adds new indicators to DB.
Anomaly detector (AD) finds anomalies.
Streamlit app presents these anomalies to the user.
User decides what to do with it: Go back to ETL code and fix these, ignore the anomalies, etc.

Challenges

If the user sees an anomaly and tries to fix it (step 4), then they will want to (i) execute the code again and (ii) go and check in the app if anything has changed (e.g. anomaly is no longer present). They might expect the app to be refreshed, meaning we must run the AD again. Is this scalable? Is the latency ok?

Possible structure of AD output

Step 2 generates an output summary of the anomalies, which is then used in the streamlit app (step 3). We should agree on its format

anomalies:
  - indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_1"
    - description: "Sudden spike in ..."
    - description: "Sudden drop in ..."
    - description: "Missing values in period ..."

  - indicator_slug: "grapher/bla/2024-01-01/bla/bla#indicator_2"
    - description: "Values way higher in ..."
      indicator_old_slug: "grapher/bla/2023-01-01/bla/bla#indicator_2"

Note: some anomalies could be just for one indicator, and others could be for that indicator relative to its old version (see indicator_old_slug).

Optional: We could try to pin-point country & years for single anomaly (i.e. 'when that anomaly happened'), though that could be either 'a single country-year', 'a single country for a year period', 'multiple countries and years', etc. So I am unsure about the format at the moment.

larsyencken commented 1 week ago

From our discussion:

First version

The UI and workflow / Lucas

In the Wizard, there is a page for anomaly detection
The page should present you with a list of all changed grapher indicators
- It should read from local disk what indicators are upgrades of other indicators
- You may optionally ask about any arbitrary ETL path for a grapher indicator
You click "Detect all anomalies"
It begins fetches data from the S3 API and finds things whilst you wait
- It should be interactive speed, i.e. take less than one minute
- If an indicator is an upgrade, it also detects within the context of the previous indicator
- (optional) It would be dreamy if they appear one by one whilst you wait
The anomaly detector tries to return only the few "most important" anomalies
It stores them in some YAML file or SQLite DB; they're still there if you come back later
It provides a one-click link for any anomaly that takes you to the indicator (either in a chart, or in the admin)
There is a button to regenerate them all
There is a button to delete them all
They are ephemeral: when you merge your branch, and your staging server dies, they disappear
- But you can still manually add something important to "What you should know about this indicator" for its data page

Anomalies within an indicator / Mojmir

We will probably use a ChatGPT-based approach. A major challenge is packing the dataset into the 128k context window. We could:

Using summary statistics or otherwise compressing each country series some way
Split the data and use multiple ChatGPT calls
Split the data, generate thumbnails, and do detection based on thumbnail images
Consider Claude with a 200k context window

Anomalies relative to a baseline / Pablo

We can experiment with an approach based on percentage change.

Later options

If all of this went well, we could consider:

Automatically trigger this on change (e.g. on push), or otherwise do them as a batch
Some kind of workflow where we save the most important stuff into metadata
Some kind of visualisation that can show you the chart with the anomaly

lucasrodes commented 6 days ago

Today we had a long discussion where we discussed which are the tasks pending and how to organize ourselves to work on this project.

From our meeting, we concluded that there are three main parts:

Backend: Working on a rule-based anomaly detection tool for single indicators and indicator upgrades.
Backend/DB: Make sense of the detected anomalies, group similar ones, and store them in a database.
UI: Present the relevant anomalies to the user in a streamlit app.

Some other important points:

We might need pagination
Review/Pending anomaly status is just optional (we don't force users to review, like in chart diff)
We need filter options to show more/less anomalies, show certain anomalies, etc.

Other pending discussion points

Agree on each phase's input and output specs (1, 2, 3).
It needs to be clarified what gets saved in the database.

Figma board

owid / etl