Open FedericoCeratto opened 8 months ago
The broken data starts from at least March 2022. I will run some scripts to figure out the exact range of affected data.
So I did some plots and it seems like the broken buckets are from 2020-10-20 up to 2022-08-04.
All this data needs to be recompressed and re-uploaded.
Example of JSONL vs uncompressed tarballs:
file *.gz
2021010301_IT_webconnectivity.n0.0.jsonl.gz: gzip compressed data
2021010301_IT_webconnectivity.n0.0.tar.gz: POSIX tar archive (GNU)
200K 2021010301_IT_webconnectivity.n0.0.jsonl.gz
1.4M 2021010301_IT_webconnectivity.n0.0.tar.gz
Before Aug 2022 postcans tarballs has been uploaded without compression. (See commit f120efbd65e849a1679e870812c859874a578a95 in the old api repository for the related change). The old postcans should be downloaded from S3, compressed with gzip and uploaded.