There was a bug coming from some fast-track datasets that prevented ETL to build.
When investigating it a bit, I realised couple of problems that I don't fully understand in the Fast-Track workflow. For the time being, we just commented out these steps (so they are not built) so that ETL can build properly (see https://github.com/owid/etl/pull/2821).
Brief summary
The steps that were failing were:
draft_joe_gini_diff_1980_2018
draft_joe_gini_diff_1980_2018_take2
draft_joe_top1share_diff_1980_2018
All of them due to some duplicate index. One can re-create it by uncommenting these steps form dag/fasttrack.yml and running etl run <step_name>.
I've tried to trace back to Google Sheet these files and see if there is anything wrong.
Example: draft_joe_gini_diff_1980_2018
I looked at the snapshot, which seems to point to the sheet "DRAFT: Joe – Difference in Gini, 1980-2018, PIP vs WID data". I believe this is the file.
In the spreadsheet, I saw that column 'year' went over the limit. That is, there were lots of empty rows which only had a single value (2020) in the 'year' column.
I then went to the Fast-Track app, and imported the new version using the "Existing Google Sheet" option
After running it, i pulled the latest changes from master and tried again to run etl run .... It kept failing.
I then looked at the actual data being loaded from the grapher step, and it looks as if it was loading a different dataset. My changes in the Google Spreadsheet were not reflected after running snap.read(). Instead, they still contained the error.
I then looked at the Fast-Track app on admin, and saw that as you proceed with the different steps, it lists a link to the spreadsheet. See "Importing sheet" below.
If you click on the linked sheet, you'll see that is not in sync with the one I previously edited.
Unsure what's the matter here, but what is being read in ETL does not correspond with what I see on Google Drive.
Could either be bc: (i) I'm editing a different file or (ii) there is some error in the snapshot links to google sheets?
Further comments
Adding data via Fast-Track can be dangerous: there is no CI/CD being shown to the user. And it seems that one can add data that might break our ETL deploy jobs.
This particular example seems to be for some experimental work. Can we use Fast-Track on staging servers? If so, we should for experimental work. If not, we should probably think about it?
There was a bug coming from some fast-track datasets that prevented ETL to build.
When investigating it a bit, I realised couple of problems that I don't fully understand in the Fast-Track workflow. For the time being, we just commented out these steps (so they are not built) so that ETL can build properly (see https://github.com/owid/etl/pull/2821).
Brief summary
The steps that were failing were:
draft_joe_gini_diff_1980_2018
draft_joe_gini_diff_1980_2018_take2
draft_joe_top1share_diff_1980_2018
All of them due to some duplicate index. One can re-create it by uncommenting these steps form
dag/fasttrack.yml
and runningetl run <step_name>
.I've tried to trace back to Google Sheet these files and see if there is anything wrong.
Example: draft_joe_gini_diff_1980_2018
etl run ...
. It kept failing.snap.read()
. Instead, they still contained the error.Unsure what's the matter here, but what is being read in ETL does not correspond with what I see on Google Drive.
Could either be bc: (i) I'm editing a different file or (ii) there is some error in the snapshot links to google sheets?
Further comments
Adding data via Fast-Track can be dangerous: there is no CI/CD being shown to the user. And it seems that one can add data that might break our ETL deploy jobs.
This particular example seems to be for some experimental work. Can we use Fast-Track on staging servers? If so, we should for experimental work. If not, we should probably think about it?