owid / etl

A compute graph for loading and transforming OWID's data
https://docs.owid.io/projects/etl
MIT License
76 stars 20 forks source link

ETL build error due to size mismatch of snapshot file #2333

Closed larsyencken closed 6 months ago

larsyencken commented 7 months ago

Problem

We're getting ETL build errors due to problems with snapshot data.

Example

https://buildkite.com/our-world-in-data/etl-build-public-datasets-master/builds/4147#018dd11b-e70d-4f42-9afe-f45479aabe5d

ValueError: Size mismatch for /var/lib/buildkite-agent/builds/etl-build-2-9/our-world-in-data/etl-build-public-datasets-master/data/snapshots/un/2018/igme.csv: expected 62755821, got 62516710

Technical notes

Marigold commented 7 months ago

Checksums (and file sizes in DVC files) have been already fixed, but Cloudflare is still serving old files from https://snapshots.owid.io. I temporarily switched to downloading directly from R2 if you have valid config (so deployments and users should be ok), but if you don't and are downloading from https://snapshots.owid.io then you're gonna get mismatch. Only CI that builds public datasets is affected

I tried purging Cloudflare cache, but without luck. My hope is that it'll get fixed by tomorrow. If not, I'll see what I can do (maybe bypassing custom domain temporarily?)

Marigold commented 6 months ago

Cloudflare cache got resetted overnight, but it turned out there was still a bug. I was stripping Windows newlines from downloaded files and that caused the size mismatch.

larsyencken commented 6 months ago

Ah, how funky. So as we download we calculate a checksum with windows newlines removed, but the actual file on disk does not have newlines removed. 🤯

Another way of saying this is that we are using newline-invariant checksums for files in snapshots. Would our lives be easier if we changed to strict checksums and didn't try to dos2unix anything? It might mean re-uploading the windows files that were snapshotted.

Marigold commented 6 months ago

Would our lives be easier if we changed to strict checksums and didn't try to dos2unix anything?

Yes... the only real reason for dos2unix was compatibility with DVC, which is gone (I don't think any Windows users would be editing our files). Even though I'm inclined to say, "It works now, then why change it?" I think it might get us into trouble in the long term. I'll run the script to get those files and reupload them (and also increase ETL epoch which is something we've wanted to do for some time).