m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Revise pcap parser file selection algorithm to eventually process 100% of the data #1022

Open mattmathis opened 2 years ago

mattmathis commented 2 years ago

Revise the archive file selection algorithm for the pcap parser to rotate through all of the data in 10% batches.

Consider a hash based selection: if (HASH(filename)+epoch) % 10 == 0 { process file } where epoch is incremented every time the pcap gardner reaches the end of the data.

mlab-code-reviews commented 2 years ago

I don't think there is any particular reason we shouldn't just let this parse all the data. It should only take a few days. Then we should probably shut it off rather than reprocessing it regularly.

A more useful bug fix would be to change the processing location, so that we aren't moving data between regions. This is the biggest concern when processing 100% of the pcaps.

We could instead consider copying the table from staging.

On Fri, Sep 24, 2021 at 4:04 PM 'Matt Mathis' via code-reviews < @.***> wrote:

Revise the archive file selection algorithm for the pcap parser to rotate through all of the data in 10% batches.

Consider a hash based selection: if (HASH(filename)+epoch) % 10 == 0 { process file } where epoch is incremented every time the pcap gardner reaches the end of the data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/m-lab/etl/issues/1022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHDGT54QNYH4HHUTFYGXRHDUDTKTXANCNFSM5EWUT7JA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

-- To unsubscribe from this group and stop receiving emails from it, send an email to @.***

-- Greg Russell / Measurement-Lab https://memegen.googleplex.com/4558349824688128

mattmathis commented 2 years ago

We are now processing 10% of the pcaps every 16 days. Please update to process all current and historical files. SELECT COUNT (DISTINCT date) AS days, MIN(parser.Time) OldestParse, FROMmlab-oti.ndt_raw.pcap` Yields: 838 2022-03-06 02:31:10.345666 UTC on 2022-03-22