Open pabloarosado opened 4 months ago
Hi! I am starting to update World Bank PIP. I selected all the grapher/explorer steps containing world_bank_pip
, I added them to the Operations list, then I added all dependencies for the three steps I selected and then I updated the 7 resulting steps.
The files were generated, but the dag was updated not the way I expected:
The garden dependency from the explorer step was replaced, I assume because the platform always considers a date by default and not latest
.
The updated steps were added at the bottom without the explorer step mentioned above:
The step updater deleted the comment to distinguish another set of steps from the World Inequality Database:
The step updater also adds the steps right next to the previous dependencies. It would be nice to have a line jump between them:
I think the new folders and files work fine, so I only need to correct the dag. Thank you very much!
The step updater gets rid of the wizard formatting, which might be clearer to read.
Hi @paarriagadap, thanks a lot for reporting these issues! The issue of removing comments is already known (listed above). I'll try to fix that soon.
If I understand correctly, the issues are related to the formatting of the written files (either the dag or the snapshot metadata files), but the step updater is behaving as expected. In other words, the code generated by the tool is correct, although not great in terms of style. Is that right?
Currently, the step updater either (1) writes new steps in the dag, or (2) overwrites dependencies of existing steps. Steps with "latest" version correspond to case (2) because the step already exists, and its dependencies need to be updated. So you would prefer those updated "latest" steps to be moved the bottom of the dag, as if they were new steps. Is that right?
Please let me know if I misunderstood your issues. Thanks!
Hi @pabloarosado, yes, it's mostly formatting and that is better that the latest steps go to the bottom (or it's replicated in both old and new steps). Thank you!
A tiny one I found now is that I had an additional script to extract the data from PIP in the snapshot folder (pip_api.py
here), and it wasn't copied to the new step, while there are shared.py
scripts in garden that were carried over to the new steps.
Thanks for the clarification, I added the suggestion about "latest" steps to the list of improvements.
Regarding pip_api.py
file, I think that's expected. The expected scenario is that there is only one code file per snapshot. In this case, that file is actually not generating snapshot, so it's a bit of a special case. Maybe we can figure out a solution for it, but I guess it's very uncommon.
should we allocate time for this during this cycle, @pabloarosado ?
One-liner
This tracking issue will list all problems and improvements for the ETL dashboard and StepUpdater.
Issues
steps_df
(fromStepUpdater
) contains only active steps (since they are the only needed ones). However, "direct_usages" (and probably other columns of dependencies) includes also archive steps. This means that, in the Operations list, when adding direct usages, archive steps suddenly appear (and also the dashboard raises anIndexError
when trying to access archive steps insteps_df
. Fixed by https://github.com/owid/etl/pull/2448long_term_crop_yields
to the Operations list.long_term_crop_yields
.long_term_wheat_yields
.faostat_qcl
.Improvements and ideas