ooni / backend

Everything related to OONI backend infrastructure: ooni/api, ooni/pipeline, ooni/sysadmin, collector, bouncers and test-helpers
BSD 3-Clause "New" or "Revised" License
48 stars 28 forks source link

Postcans from Oct 2020 to Aug 2022 are not compressed #763

Open FedericoCeratto opened 8 months ago

FedericoCeratto commented 8 months ago

Before Aug 2022 postcans tarballs has been uploaded without compression. (See commit f120efbd65e849a1679e870812c859874a578a95 in the old api repository for the related change). The old postcans should be downloaded from S3, compressed with gzip and uploaded.

hellais commented 8 months ago

The broken data starts from at least March 2022. I will run some scripts to figure out the exact range of affected data.

hellais commented 8 months ago

So I did some plots and it seems like the broken buckets are from 2020-10-20 up to 2022-08-04.

All this data needs to be recompressed and re-uploaded.

FedericoCeratto commented 8 months ago

Example of JSONL vs uncompressed tarballs:

file *.gz
2021010301_IT_webconnectivity.n0.0.jsonl.gz:  gzip compressed data
2021010301_IT_webconnectivity.n0.0.tar.gz:    POSIX tar archive (GNU)

200K 2021010301_IT_webconnectivity.n0.0.jsonl.gz 
1.4M 2021010301_IT_webconnectivity.n0.0.tar.gz