Delete steps instead of archiving them

Marigold commented 1 month ago

Background

We have new versions of datasets all the time, and after updating one we often archive the old version. We do this since we don't want to waste time building the old version again and again, but we want the code and old dependencies (ingredients) nearby for reference.

New information

A recent poll showed that no one is using archived steps. It's important to distinguish between archived code, archived datasets, and archived steps in the DAG. We have some mechanisms for archiving steps (typically in the form of if isArchived: ...), which adds extra work when I encounter them. We also don't have a method to archive snapshots, which would have been useful recently.

Most archived steps likely won't work because our core codebase has changed, and we don't maintain archived steps. It's usually easier to go back in the git history and rerun the process. What could be useful, however, are archived datasets in the catalog, so you wouldn't need to rerun the ETL just to rebuild them. These datasets could remain in the catalog indefinitely, hidden from the public.

Proposal

Delete all archived steps
Get rid of now redundant code that has to check if a step is archived
Keep all archived datasets

paarriagadap commented 1 month ago

Related to this, I just had an issue with archiving a previous version of World Bank PIP. I couldn't find a use of the old version anywhere else, moved those steps to dag/archive and the merge couldn't deploy because of an ambiguous error: etl.helpers.CurrentStepMustBeInDag: Current step must be listed in the dag..

I turned out that it was actually used in the covid dag, so better error messages might be also needed in this area.

larsyencken commented 2 weeks ago

Queued for discussion at a data chat once Lucas is back.

larsyencken commented 1 week ago

Very similar to another discussion we had in triage, closing for now, and can be re-opened according to how much pain we get from archived steps.

owid / etl