owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
75 stars 20 forks source link

Move code of archived datasets to a dedicated folder or start deleting them #1242

Closed Marigold closed 12 months ago

Marigold commented 1 year ago

Currently we have archive DAG where we put all datasets that aren't meant to be run. We still keep code of their steps in etl/steps which is a bit inconvenient for a few reasons:

  1. When refactoring code (renames etc.), we usually do it for all versions, even the archived ones, because it's not easy to distinguish them
  2. When searching for code I sometimes end up in an archived version and only realise after some time
  3. Our type check and linting time is growing
  4. By having it in our codebase, we commit to keeping all versions functional

For example aviation_safety_network has three versions 2022-10-12, 2022-10-14, 2023-04-18 that are nearly identical. For me the biggest pain was refactoring faostat and WDI.

Moving all archived steps to archive/steps folder would make it easy to exclude that folder from search, linting or refactoring.

Another, more aggressive option, would be to start deleting datasets. I know it's controversial, but it lowers maintenance burden and makes refactoring much easier. I think it'd be fine because:

  1. Old datasets versions still exist in the catalog and can be downloaded from there in case someone wants to compare data
  2. If you want to compare code, you can just go back in git history
  3. We can keep code of the latest two versions (one could be archived). Do you think it's likely to look more than one version back?

This is especially relevant due to upcoming metadata changes. What do you think @pabloarosado @lucasrodes (maybe @danyx23 )?

danyx23 commented 1 year ago

What do you think of keeping a tombstone with the git sha around in the directory? I.e. the directory would still exist but only consist of a tombstone text file with the last git sha where this directory still contained content (i.e. HEAD~1). Maybe additionally it could also contain a readme that explains this and says where in the catalog to find this data or something like that.

This would make it easier to find this code in the future if you still need it but otherwise leave us with a much cleaner git repo. Thoughts?

Marigold commented 1 year ago

That would work too. Anything to have less files to maintain :). We could also store git sha in the catalog (dataframe) that would make it easy to find.

larsyencken commented 1 year ago

On tombstones

I really like the idea of a tombstone. Instead of tombstones everywhere though, I think we should keep one central list of them, and have a command archive-steps you can run (which dry-runs unless you say --really) which will archive a dataset and the subtree that depends on it, with appropriate pre-flight checks in the db.

On state

I'd prefer to avoid creating a stateful catalog where we have ghost data but no recipes in it. The whole system has been designed to be a straightforward function of inputs, and the catalog is meant to be throwaway, which is a really simplifying property. If we remove a dataset, we should remove it from the catalog too.

On archiving

We do in fact keep an archive of the catalog at every git sha on the master branch, that's https://github.com/owid/data-catalog. It links every catalog snapshot back to the sha that built it. And likewise there is a git-note on every etl commit linking back to the snapshot that contains that data. That's a pretty strong starting place for archiving data.

If we wanted to get more user friendly for the public or for automation, we could leave a trail of snapshots every month/quarter in S3 in a public bucket.

pabloarosado commented 1 year ago

Having old versions of datasets is one of the things I value the most about the ETL. And by that, I mean having access both to the data and to the code that generated it. I very often load old and new versions to check if some weirdness in the data is new or was there already. For example, people often ask what happened to a particular country or a specific indicator that used to be there and now is not. Currently, we are "archiving steps" because we are constantly changing things. But going ahead hopefully we'll reach a more stable state, and then I'd really appreciate keeping all versions (maybe not strictly all, sometimes there are, as Mojmir mentioned, versions with very minor changes). I know that "everything is in git" but my intuition is that when you start going back and forth on git versions, it becomes messy to build recipes, and to be able to compare versions properly. If we figure out a feasible way to do so, great, but otherwise, I find it a big loss if we need to archive old steps by construction.

Marigold commented 1 year ago

@larsyencken we've recently stopped pruning old datasets in S3 catalog. We ran into some issue with explorers where we accidentally deleted the old version and the explorer stopped working. So we thought it'd be safer to keep them in the catalog. (I don't have an opinion on whether that's good or bad)

Marigold commented 1 year ago

I'd be happy with moving code of all archived steps to an /archive folder (with help of the archive-steps command).

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.