mozilla / docker-etl

Collection of dockerized ETL jobs managed by data engineering.
Mozilla Public License 2.0
16 stars 14 forks source link

[dev dap collector] support more time precision settings on tasks #234

Closed dmueller closed 2 months ago

dmueller commented 2 months ago

https://mozilla-hub.atlassian.net/browse/AE-457

problem

As we work towards the MVP of PPA, we still haven't settled on what time precision tasks will need to use. This setting is dependent on how many reports advertiser's websites will send, so it can reasonably vary for different advertisers.

The dev tasks we have created have used 5 minute time precision, and we're mostly considering something in the range of 5 minutes to 7 days.

solution

Expand the functionality of the collector so that it can support more time precision settings for tasks.

The collector has been adjusted to support any time precision that either:

  1. evenly divides a day or
  2. is an even multiple of a day

When it runs:

There will be an adjustment to the airflow job that runs this as well, so that it runs at a daily cadence as well. I'll do that once this review is ready. Similarly, I'll make the same changes to the prod setup.

testing

created dev tasks with 1hr, 1day, 2days, and 1week for their time precision setting. ran locally and printed the results to confirm that the collector was queried when appropriate (though all the responses were BATCH TOO SMALL since these tasks haven't had reports yet).

# run collector not at the start of a day, collector did not request any data for the tasks
python3 main.py --date=2024-06-24T19:15:00Z
Now processing task: 5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU
{'counts': [], 'reports': [{'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': datetime.datetime(2024, 6, 24, 19, 15, tzinfo=datetime.timezone.utc), 'metric_type': 'vector', 'collection_time': '1720624096.684044', 'collection_duration': 0, 'error': 'Collector triggered not at the start of a day 2024-06-24T19:15:00Z'}]}

# query for an hourly aggregated task, requested hourly data for the whole day prior
python3 main.py --date=2024-06-24T00:00:00Z

Now processing task: 5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU
1720624129.063142 Collecting 2024-06-23 00:00:00+00:00 - 2024-06-23 01:00:00+00:00
1720624129.063195 Collecting 2024-06-23 01:00:00+00:00 - 2024-06-23 02:00:00+00:00
1720624129.06322 Collecting 2024-06-23 02:00:00+00:00 - 2024-06-23 03:00:00+00:00
1720624129.063245 Collecting 2024-06-23 03:00:00+00:00 - 2024-06-23 04:00:00+00:00
1720624129.063265 Collecting 2024-06-23 04:00:00+00:00 - 2024-06-23 05:00:00+00:00
1720624129.063286 Collecting 2024-06-23 05:00:00+00:00 - 2024-06-23 06:00:00+00:00
1720624129.063304 Collecting 2024-06-23 06:00:00+00:00 - 2024-06-23 07:00:00+00:00
1720624129.063325 Collecting 2024-06-23 07:00:00+00:00 - 2024-06-23 08:00:00+00:00
1720624129.063348 Collecting 2024-06-23 08:00:00+00:00 - 2024-06-23 09:00:00+00:00
1720624129.063367 Collecting 2024-06-23 09:00:00+00:00 - 2024-06-23 10:00:00+00:00
1720624129.063195 Result: code 1
1720624129.538125 Collecting 2024-06-23 10:00:00+00:00 - 2024-06-23 11:00:00+00:00
1720624129.063142 Result: code 1
1720624129.546253 Collecting 2024-06-23 11:00:00+00:00 - 2024-06-23 12:00:00+00:00
1720624129.063265 Result: code 1
1720624129.555785 Collecting 2024-06-23 12:00:00+00:00 - 2024-06-23 13:00:00+00:00
1720624129.063367 Result: code 1
1720624129.55592 Collecting 2024-06-23 13:00:00+00:00 - 2024-06-23 14:00:00+00:00
1720624129.063348 Result: code 1
1720624129.555954 Collecting 2024-06-23 14:00:00+00:00 - 2024-06-23 15:00:00+00:00
1720624129.063286 Result: code 1
1720624129.564888 Collecting 2024-06-23 15:00:00+00:00 - 2024-06-23 16:00:00+00:00
1720624129.063304 Result: code 1
1720624129.565055 Collecting 2024-06-23 16:00:00+00:00 - 2024-06-23 17:00:00+00:00
1720624129.063325 Result: code 1
1720624129.565113 Collecting 2024-06-23 17:00:00+00:00 - 2024-06-23 18:00:00+00:00
1720624129.06322 Result: code 1
1720624129.565148 Collecting 2024-06-23 18:00:00+00:00 - 2024-06-23 19:00:00+00:00
1720624129.063245 Result: code 1
1720624129.565183 Collecting 2024-06-23 19:00:00+00:00 - 2024-06-23 20:00:00+00:00
1720624129.565055 Result: code 1
1720624129.708626 Collecting 2024-06-23 20:00:00+00:00 - 2024-06-23 21:00:00+00:00
1720624129.555785 Result: code 1
1720624129.708858 Collecting 2024-06-23 21:00:00+00:00 - 2024-06-23 22:00:00+00:00
1720624129.538125 Result: code 1
1720624129.708924 Collecting 2024-06-23 22:00:00+00:00 - 2024-06-23 23:00:00+00:00
1720624129.555954 Result: code 1
1720624129.708974 Collecting 2024-06-23 23:00:00+00:00 - 2024-06-24 00:00:00+00:00
1720624129.55592 Result: code 1
1720624129.546253 Result: code 1
1720624129.565113 Result: code 1
1720624129.564888 Result: code 1
1720624129.565148 Result: code 1
1720624129.565183 Result: code 1
1720624129.708924 Result: code 1
1720624129.708858 Result: code 1
1720624129.708626 Result: code 1
1720624129.708974 Result: code 1
{'reports': [{'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719104400, 'metric_type': 'vector', 'collection_time': '1720624129.063195', 'collection_duration': 0.47481283405795693, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719100800, 'metric_type': 'vector', 'collection_time': '1720624129.063142', 'collection_duration': 0.4829492080025375, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719115200, 'metric_type': 'vector', 'collection_time': '1720624129.063265', 'collection_duration': 0.4924679999239743, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719133200, 'metric_type': 'vector', 'collection_time': '1720624129.063367', 'collection_duration': 0.4925259998999536, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719129600, 'metric_type': 'vector', 'collection_time': '1720624129.063348', 'collection_duration': 0.4925886248238385, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719118800, 'metric_type': 'vector', 'collection_time': '1720624129.063286', 'collection_duration': 0.5015291669405997, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719122400, 'metric_type': 'vector', 'collection_time': '1720624129.063304', 'collection_duration': 0.5017256252467632, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719126000, 'metric_type': 'vector', 'collection_time': '1720624129.063325', 'collection_duration': 0.5017621251754463, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719108000, 'metric_type': 'vector', 'collection_time': '1720624129.06322', 'collection_duration': 0.5018983329646289, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719111600, 'metric_type': 'vector', 'collection_time': '1720624129.063245', 'collection_duration': 0.5019117500633001, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719158400, 'metric_type': 'vector', 'collection_time': '1720624129.565055', 'collection_duration': 0.14349116710945964, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719144000, 'metric_type': 'vector', 'collection_time': '1720624129.555785', 'collection_duration': 0.15295612486079335, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719136800, 'metric_type': 'vector', 'collection_time': '1720624129.538125', 'collection_duration': 0.17062800005078316, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719151200, 'metric_type': 'vector', 'collection_time': '1720624129.555954', 'collection_duration': 0.15299729071557522, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719147600, 'metric_type': 'vector', 'collection_time': '1720624129.55592', 'collection_duration': 0.1530827502720058, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719140400, 'metric_type': 'vector', 'collection_time': '1720624129.546253', 'collection_duration': 0.17473249975591898, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719162000, 'metric_type': 'vector', 'collection_time': '1720624129.565113', 'collection_duration': 0.15648162504658103, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719154800, 'metric_type': 'vector', 'collection_time': '1720624129.564888', 'collection_duration': 0.15683354204520583, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719165600, 'metric_type': 'vector', 'collection_time': '1720624129.565148', 'collection_duration': 0.15687908325344324, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719169200, 'metric_type': 'vector', 'collection_time': '1720624129.565183', 'collection_duration': 0.17167654167860746, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719180000, 'metric_type': 'vector', 'collection_time': '1720624129.708924', 'collection_duration': 0.15858970815315843, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719176400, 'metric_type': 'vector', 'collection_time': '1720624129.708858', 'collection_duration': 0.15869475016370416, 'error': 'BATCH TOO SMALL'},{'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719172800, 'metric_type': 'vector', 'collection_time': '1720624129.708626', 'collection_duration': 0.15884491708129644, 'error': 'BATCH TOO SMALL'}, {'task_id': '5SEsTxqUHlFlicTOwf56ub1vOqnbU5wCuOf1q_TiyWU', 'slot_start': 1719183600, 'metric_type': 'vector', 'collection_time': '1720624129.708974', 'collection_duration': 0.15922337491065264, 'error': 'BATCH TOO SMALL'}], 'counts': []}

# query for a daily aggregated task, requested the data ending on the passed in date
python3 main.py --date=2024-06-24T00:00:00
Now processing task: mTHelZIxWX1DPv3xD0ZsHtxwqRaU_gJvCjyBxQoR9UU
1720624200.371989 Collecting 2024-06-23 00:00:00+00:00 - 2024-06-24 00:00:00+00:00
1720624200.371989 Result: code 1
{'reports': [{'task_id': 'mTHelZIxWX1DPv3xD0ZsHtxwqRaU_gJvCjyBxQoR9UU', 'slot_start': 1719100800, 'metric_type': 'vector', 'collection_time': '1720624200.371989', 'collection_duration': 0.1712839170359075, 'error': 'BATCH TOO SMALL'}], 'counts': []}

# query for a task aggregated every other day, an aggregation does not align with the passed in date
python3 main.py --date=2024-06-27T00:00:00Z
Now processing task: 5_ASrSFFGEa49tgqWtTT2srYkYVw7HR7UaOrQfyl0Tg
{'counts': [], 'reports': [{'task_id': '5_ASrSFFGEa49tgqWtTT2srYkYVw7HR7UaOrQfyl0Tg', 'slot_start': datetime.datetime(2024, 6, 27, 0, 0, tzinfo=datetime.timezone.utc), 'metric_type': 'vector', 'collection_time': '1720625795.962135', 'collection_duration': 0, 'error': '2024-06-27T00:00:00Z does not align with task time precision buckets'}]}

# query for a task aggregated every other day, requested data ending on the passed in date
python3 main.py --date=2024-06-26T00:00:00Z
Now processing task: 5_ASrSFFGEa49tgqWtTT2srYkYVw7HR7UaOrQfyl0Tg
1720625808.55165 Collecting 2024-06-24 00:00:00+00:00 - 2024-06-26 00:00:00+00:00
1720625808.55165 Result: code 1
{'reports': [{'task_id': '5_ASrSFFGEa49tgqWtTT2srYkYVw7HR7UaOrQfyl0Tg', 'slot_start': 1719187200, 'metric_type': 'vector', 'collection_time': '1720625808.55165', 'collection_duration': 0.1682301671244204, 'error': 'BATCH TOO SMALL'}], 'counts': []}

# query for a task aggregated weekly, an aggregation does not align with the passed in date
python3 main.py --date=2024-06-25T00:00:00Z
Now processing task: aW7snmf2svP5lcuaLkukUTbDBZ12RHE546mQwpZA8_0
{'counts': [], 'reports': [{'task_id': 'aW7snmf2svP5lcuaLkukUTbDBZ12RHE546mQwpZA8_0', 'slot_start': datetime.datetime(2024, 6, 25, 0, 0, tzinfo=datetime.timezone.utc), 'metric_type': 'vector', 'collection_time': '1720625671.344491', 'collection_duration': 0, 'error': '2024-06-25T00:00:00Z does not align with task time precision buckets'}]}

# query for a task aggregated weekly, requested data ending on the passed in date
python3 main.py --date=2024-06-27T00:00:00Z
Now processing task: aW7snmf2svP5lcuaLkukUTbDBZ12RHE546mQwpZA8_0
1720625682.632093 Collecting 2024-06-20 00:00:00+00:00 - 2024-06-27 00:00:00+00:00
1720625682.632093 Result: code 1
{'reports': [{'task_id': 'aW7snmf2svP5lcuaLkukUTbDBZ12RHE546mQwpZA8_0', 'slot_start': 1718841600, 'metric_type': 'vector', 'collection_time': '1720625682.632093', 'collection_duration': 0.2570650838315487, 'error': 'BATCH TOO SMALL'}], 'counts': []}

Checklist for reviewer:

dmueller commented 2 months ago

I applied the same updates to this job that were recommended on the Prod review https://github.com/mozilla/docker-etl/pull/236