m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

parser: add filter.IsOAM logic to standard parsers #893

Open stephen-soltesz opened 4 years ago

stephen-soltesz commented 4 years ago

Today, we manually enumerate the OAM IPs in views. Ultimately, the parser should receive a list of OAM IPs from configuration at run time and label a standard column "filter.IsOAM" field accordingly.

         "35.193.254.117", -- script-exporter VMs in GCE, sandbox.            
          "35.225.75.192", -- script-exporter VM in GCE, staging.              
          "35.192.37.249", -- script-exporter VM in GCE, oti.                  
          "23.228.128.99", "2605:a601:f1ff:fffe::99", -- ks addresses.         
          "45.56.98.222", "2600:3c03::f03c:91ff:fe33:819", -- eb addresses.    
          "35.202.153.90", "35.188.150.110" -- Static IPs from GKE VMs for e2e tests.
mattmathis commented 2 years ago

I think this needs to be more agile than can easily done in the parser.

stephen-soltesz commented 2 years ago

What requirement is not met by the parser configuration including these values? Please describe a scenario that cannot be met.

mattmathis commented 2 years ago

OAM addresses can can change as an unexpected side effect of operational events. If IsOAM is bound in the parser, it has to be updated before the data arrives from the fleet otherwise it leaks into BQ. If OAM is done in the Views, we can retroactively remove OAM data, and don't have to be quite as prompt on the update.

stephen-soltesz commented 2 years ago

My thought is that we do not want the views to be a receptacle for post-hoc configurations. The set of such things is unbounded. Preferably static filter logic is managed by the parsers and optionally in the views as a "hot fix". View-based management is an expediency not preferable design (imo). Ideally, we would never archive the OAM measurements..

mattmathis commented 2 years ago

A query to detect OAM traffic found more than 60 likely candidate clients. See: OAM Client Scan 2022-08-19 - Sheets

Although a small number appear to be spurious (e.g. 192.168.0.192), the vast majority are pretty clearly legit.

I strongly advocate marking isOAM as part of a late stage materialized join.

Note that we do need to be able to do post-hoc configurations, in order to properly label canary data, because we don't know if we trust new deployments until they collect significant data. In nearly all cases we treat canary data a valid: indeed it matches future production data. However, in the rare cases where we roll back canaries (which we have done), we need to have the capability to retroactively mark the data as non-production.

stephen-soltesz commented 2 years ago

@mattmathis regarding the canaries, is there a way to automate the retroactive labeling without human intervention? Or via the same signals that we use operationally when rolling forward or backward?

stephen-soltesz commented 2 years ago

@mattmathis you performed some work for the fixit - is that enough to close this issue? Or, is there some remaining work to capture the result of your query above?