Closed danyx23 closed 10 months ago
Couple of things I love about this approach:
I'm happy to help with the prototype!
Marking as nice-to-have for this cycle, but we could revisit it in a future Data Pages revision.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hilarious timing for stalebot to close it given our discussions this morning.
Happy to help out next week on this if needed!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
As part of the data pages project we want to point users to the code that built the (grapher level) dataset of the variable they are looking at in the data page. We initially thought that just linking to the python source file of the grapher step would be fine but as we looked into this in more detail we thought it could be quite confusing.
What we now want is a tool that can generate a readme.md for each dataset folder (maybe for all steps, maybe only the grapher level steps) that gives a brief intro to how data processing in the etl works at OWID and what the depdencies of the dataset in this folder are - rendered as a mermaid diagram in the readme.
There is more context and discussion in this Notion page.
We already have a script that can generate a dot file syntax for graphviz, but mermaid has the benefit of rendering inline on github and allowing a limited form of interactivity (clicks send you to github urls of the step in question).
It is a bit unclear when we'd like to generate this readme. We want it to end up on github and stay up to date, so we could either include it in the
make etl
command or we could have a github action that (re)generates the readme.md files.One other question is if we want to use
readme.md
- we will want to overwrite these files when we update the etl and that could overwrite manual changes. So maybedependencies.md
would be better?The text for the readme.md should come from a template that should be stored somewhere in this repository.
The language in the readme should try to explain the steps in a way that is understandable even if users have no clear idea of how data pipelines work.
Below is the current draft of what a readme could look like.
Example readme for WHO GHE dataset
The data processing pipeline at Our World In Data follows a common structure, but different datasets need different levels of curation, harmonization and combination with other datasets. The common structure of processing is as follows:
The chart below shows the processing steps that result in the dataset in this folder. Every box links to the code on Github that is used to execute the corresponding step - press CTRL/CMD to open the linked target in the browser.