ooni / backend

Everything related to OONI backend infrastructure: ooni/api, ooni/pipeline, ooni/sysadmin, collector, bouncers and test-helpers
BSD 3-Clause "New" or "Revised" License
50 stars 29 forks source link

Duplicate data present in some s3 buckets #613

Open hellais opened 1 year ago

hellais commented 1 year ago

I have noticed that certain old buckets contain duplicate measurement data, which can be confusing for somebody analysing the data from the JSONL outputs. If they are not careful they might end up double counting measurements.

A case in point is whatsapp data for Italy from the 20180101 bucket, where we have the following report_ids present:

['20171231T091557Z_AS12874_pnMQzVEE8ILhRBSSmiRJIxGHhpdwRLMwvPtU7y8WwOcoDcmiqG', '20171231T105131Z_AS12874_98QAaH11uDox6sJH1YuYF1n3aXNXF5VTk4YX49UQDaTWUYTnjr', '20171231T105316Z_AS12874_miArFfG7Br19RmVZWYBDZKzTtzraYIpOQPYzvMF4RTeyolKTwb', '20171231T134433Z_AS3269_06X59S6ReknS3PbSEuUMqWhbJPcTGwQr4x0ltwMnZVSmfTUR79', '20171231T091557Z_AS12874_pnMQzVEE8ILhRBSSmiRJIxGHhpdwRLMwvPtU7y8WwOcoDcmiqG', '20171231T105131Z_AS12874_98QAaH11uDox6sJH1YuYF1n3aXNXF5VTk4YX49UQDaTWUYTnjr', '20171231T105316Z_AS12874_miArFfG7Br19RmVZWYBDZKzTtzraYIpOQPYzvMF4RTeyolKTwb', '20171231T134433Z_AS3269_06X59S6ReknS3PbSEuUMqWhbJPcTGwQr4x0ltwMnZVSmfTUR79']

Upon inspecting the content of the bucket I can see two files with a different filename format and a different timestamp:

% aws s3 ls --no-sign-request s3://ooni-data-eu-fra/jsonl/whatsapp/IT/20200101/00/
2021-07-09 12:34:20      39562 20200101_IT_whatsapp.l.0.jsonl.gz
2022-05-19 21:57:02      39526 20200101_IT_whatsapp.x.411d9d2e15f1bd08.jsonl.gz

I believe this is due to the reprocessing that was done at some point, but the old files where left around.

We should clean these duplicates up or at least document how somebody can easily distinguish between the two and avoid double counting.