owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
58 stars 18 forks source link

wizard: problems and improvements for etl dashboard and step updater #2365

Open pabloarosado opened 4 months ago

pabloarosado commented 4 months ago

One-liner

This tracking issue will list all problems and improvements for the ETL dashboard and StepUpdater.

Issues

Improvements and ideas

paarriagadap commented 3 months ago

Hi! I am starting to update World Bank PIP. I selected all the grapher/explorer steps containing world_bank_pip, I added them to the Operations list, then I added all dependencies for the three steps I selected and then I updated the 7 resulting steps.

The files were generated, but the dag was updated not the way I expected:

The step updater also adds the steps right next to the previous dependencies. It would be nice to have a line jump between them:

image

I think the new folders and files work fine, so I only need to correct the dag. Thank you very much!

paarriagadap commented 3 months ago

The step updater gets rid of the wizard formatting, which might be clearer to read.

image image
pabloarosado commented 3 months ago

Hi @paarriagadap, thanks a lot for reporting these issues! The issue of removing comments is already known (listed above). I'll try to fix that soon.

If I understand correctly, the issues are related to the formatting of the written files (either the dag or the snapshot metadata files), but the step updater is behaving as expected. In other words, the code generated by the tool is correct, although not great in terms of style. Is that right?

Currently, the step updater either (1) writes new steps in the dag, or (2) overwrites dependencies of existing steps. Steps with "latest" version correspond to case (2) because the step already exists, and its dependencies need to be updated. So you would prefer those updated "latest" steps to be moved the bottom of the dag, as if they were new steps. Is that right?

Please let me know if I misunderstood your issues. Thanks!

paarriagadap commented 3 months ago

Hi @pabloarosado, yes, it's mostly formatting and that is better that the latest steps go to the bottom (or it's replicated in both old and new steps). Thank you!

A tiny one I found now is that I had an additional script to extract the data from PIP in the snapshot folder (pip_api.py here), and it wasn't copied to the new step, while there are shared.py scripts in garden that were carried over to the new steps.

pabloarosado commented 3 months ago

Thanks for the clarification, I added the suggestion about "latest" steps to the list of improvements.

Regarding pip_api.py file, I think that's expected. The expected scenario is that there is only one code file per snapshot. In this case, that file is actually not generating snapshot, so it's a bit of a special case. Maybe we can figure out a solution for it, but I guess it's very uncommon.

lucasrodes commented 3 weeks ago

should we allocate time for this during this cycle, @pabloarosado ?