Write a script that generates readme.md files for datasets with mermaid depdency graphs

danyx23 commented 1 year ago

As part of the data pages project we want to point users to the code that built the (grapher level) dataset of the variable they are looking at in the data page. We initially thought that just linking to the python source file of the grapher step would be fine but as we looked into this in more detail we thought it could be quite confusing.

What we now want is a tool that can generate a readme.md for each dataset folder (maybe for all steps, maybe only the grapher level steps) that gives a brief intro to how data processing in the etl works at OWID and what the depdencies of the dataset in this folder are - rendered as a mermaid diagram in the readme.

There is more context and discussion in this Notion page.

We already have a script that can generate a dot file syntax for graphviz, but mermaid has the benefit of rendering inline on github and allowing a limited form of interactivity (clicks send you to github urls of the step in question).

It is a bit unclear when we'd like to generate this readme. We want it to end up on github and stay up to date, so we could either include it in the make etl command or we could have a github action that (re)generates the readme.md files.

One other question is if we want to use readme.md - we will want to overwrite these files when we update the etl and that could overwrite manual changes. So maybe dependencies.md would be better?

The text for the readme.md should come from a template that should be stored somewhere in this repository.

The language in the readme should try to explain the steps in a way that is understandable even if users have no clear idea of how data pipelines work.

Below is the current draft of what a readme could look like.

Example readme for WHO GHE dataset

The data processing pipeline at Our World In Data follows a common structure, but different datasets need different levels of curation, harmonization and combination with other datasets. The common structure of processing is as follows:

Fetch data from an upstream provider (often large institutions like the World Bank, the WHO, …)
Store a snapshot of this data to guarantee reproducibility
Reformat the data into a pandas dataframe, verify data types
Harmonize countries and curate/enhance/combine the data (e.g. by adding per capita metrics)
Prepare the data for visualization in our Grapher tool

The chart below shows the processing steps that result in the dataset in this folder. Every box links to the code on Github that is used to execute the corresponding step - press CTRL/CMD to open the linked target in the browser.

flowchart TD
    upstream[https://www.who.int/data/global-health-estimates]---|snapshotted on 2022-09-30|walden://who/2022-09-30/ghe
    data://garden/reference[OWID reference dataset for countries]-->data://garden/who/2022-09-30/ghe
    walden://who/2022-09-30/ghe[Snapshot of original data: walden://who/2022-09-30/ghe]-->data://meadow/who/2022-09-30/ghe
    data://meadow/who/2022-09-30/ghe[Reformatted into a dataframe: data://meadow/who/2022-09-30/ghe]-->data://garden/who/2022-09-30/ghe
    data://garden/who/2022-09-30/ghe[Countries harmonized and data transformed: data://garden/who/2022-09-30/ghe]-->data://grapher/who/2022-09-30/ghe[Prepared for vizualisation: data://grapher/who/2022-09-30/ghe]
    click upstream "https://www.who.int/data/global-health-estimates" "Upstream data"
    click data://garden/reference "https://github.com/owid/etl/tree/master/data/garden/reference" "Reference"
    click walden://who/2022-09-30/ghe "https://github.com/owid/walden/blob/master/owid/walden/index/who/2022-09-30/ghe.json" "Walden snapshot"
    click data://meadow/who/2022-09-30/ghe "https://github.com/owid/etl/tree/master/etl/steps/data/meadow/who/2022-09-30/ghe.py" "Meadow transfromations (reformatting to dataframe)"
    click data://garden/who/2022-09-30/ghe "https://github.com/owid/etl/blob/master/etl/steps/data/garden/who/2022-09-30/ghe.py" "Garden transformations (country harmonization etc.)"
    click data://grapher/who/2022-09-30/ghe "https://github.com/owid/etl/blob/master/etl/steps/data/grapher/who/2022-09-30/ghe.py" "Grapher transformations (prepare for vizualisation)"

Marigold commented 1 year ago

Couple of things I love about this approach:

We stay in ETL land (with github and markdown)
Pipeline description could be stored in a structured YAML file that we generate README from
We could have another walkthrough with README previews that should make creating them as efficient as possible

I'm happy to help with the prototype!

larsyencken commented 1 year ago

Marking as nice-to-have for this cycle, but we could revisit it in a future Data Pages revision.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

larsyencken commented 1 year ago

Hilarious timing for stalebot to close it given our discussions this morning.

Happy to help out next week on this if needed!

stale[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

owid / etl

Write a script that generates readme.md files for datasets with mermaid depdency graphs #882

Example readme for WHO GHE dataset