caseyjlaw commented 2 years ago

Feature request

Intro

We need to include a priori information from subsystems (esp. f-engine) that can identify bad data. Currently, we rely on post processing to identify bad data (e.g., calibration solution fails for a bad ant-pol). This is unreliable and biases the analysis of good data.

Feature

Set up a daemon that polls all snaps with the f-engine python client. Use the equivalent of print_status_all(ignore_ok=True) to get a summary of bad inputs. Map these bad inputs to ant-pol and save the time and state of all inputs. The saved information should be parsable into a flag table in CASA MS. The real-time flagger (in development) could read from etcd to get an instantaneous set of bad inputs to be flagged.

One possibility is to create a new etcd key that gets ingested to influx. E.g., /mon/health may hold a dict with keys "LWA-nnna" and boolean value, where nnna is ant-pol. The time history of each key should be available with an influx query (i.e., a python client or in grafana).

Other use cases should be considered. Please add ideas to this issue!

Example of on-SNAP logic

>  lwa_feng.print_status_all(ignore_ok=True)
Block fpga stats:
serial: None
vccbram: 0.90234375
vccint: 0.90234375
Block adc stats:
Block sync stats:
Block noise stats:
Block input stats:
mean00: -2.4887237548828125
mean03: -4.364715576171875
mean09: 2.926666259765625
mean12: -2.0069427490234375
mean15: -2.5945587158203125
mean19: -2.9351959228515625
mean23: -4.0443267822265625
mean24: -2.985595703125
mean26: 4.0575714111328125
mean31: -2.9574432373046875
mean32: -3.5904083251953125
mean34: -2.0127105712890625
mean35: -2.95806884765625
mean36: -3.05712890625
mean39: -2.752716064453125
mean40: -2.2051239013671875
mean41: 3.25775146484375
mean42: -2.5319671630859375
mean48: -3.2537384033203125
mean49: -3.0987091064453125
mean51: -3.1657257080078125
mean52: -2.99749755859375
mean54: -3.306396484375
mean56: -2.016448974609375
rms00: 2.539978801008054
rms01: 1.892431987714486
rms02: 57.44100194306879
rms03: 60.96358265919525
rms04: 1.123948347432084
rms05: 1.5414747513922138
rms06: 0.3455430861905104
rms07: 1.0452485616656273
rms08: 1.3102776013418398
rms09: 2.9447322041072344
rms10: 0.555429741070484
rms11: 1.0267123178348208
rms12: 2.0172859859284795
rms13: 1.7469805144788393
rms14: 1.2152770027279274
rms15: 2.6425283220085243
rms16: 0.49776723154188746
rms17: 0.9447716040340451
rms18: 1.4075352199884068
rms19: 2.9480854610648906
rms20: 0.9477386866452495
rms21: 0.20915771140980555
rms22: 0.23079955073003824
rms23: 4.0505439298851815
rms24: 3.000513668589201
rms25: 0.790260538525112
rms26: 4.0754320161381745
rms27: 1.0047266832366961
rms28: 1.0274254363912498
rms29: 1.7525441885477524
rms30: 1.6022578520978996
rms31: 2.968197419790598
rms32: 3.624863194482095
rms33: 0.9589823860722756
rms34: 2.025069473393521
rms35: 2.9683285062630365
rms36: 3.0691592936065164
rms37: 1.9088057598171002
rms38: 0.9668185761453322
rms39: 2.7873769416502947
rms40: 2.2475199276525477
rms41: 3.2890253857763474
rms42: 2.5809909496599484
rms43: 1.0373870420436144
rms44: 1.0262812352509059
rms45: 1.4104105120495947
rms46: 0.3980735163434796
rms47: 1.0792142369060116
rms48: 3.284478043846208
rms49: 3.122936330321543
rms50: 0.8905478996232811
rms51: 3.188890340925086
rms52: 3.0107128873148548
rms53: 0.9945300225836887
rms54: 3.3397957781642504
rms55: 1.2252680271194747
rms56: 2.031392723351242
rms57: 1.0576996916399817
rms58: 1.06982054159141
rms59: 0.9938240071695604
rms60: 1.9161221380307578
rms61: 1.7629828474797515
rms62: 0.561943626753109
rms63: 1.0417757104383722
Block delay stats:
Block pfb stats:
overflow_count: 3006321255
Block eq stats:
Block eqtvg stats:
Block reorder stats:
Block packetizer stats:
Block eth stats:
tx_full: 17
Block autocorr stats:
Block corr stats:
Block powermon stats:
vcc_int_0v95_current: 41.7

jaycedowell commented 2 years ago

It's not clear to me how this relates to the data recorders.

caseyjlaw commented 2 years ago

After thinking about it, I think it doesn't. But do you think the data recorders could provide input to such a service? That is, are there monitor points or clients that we can use to assess whether bad data is being written? I think not, since it either writes or it doesn't. Is that fair?

jaycedowell commented 2 years ago

Sure, there are the statistics/* and diagnostics/* monitoring points that are currently being written that might be useful. Maybe the recorder provided values make the most sense for the beamformer outputs where you cannot go back and redo the beamforming if there are problems.

caseyjlaw commented 2 years ago

Further discussion on a different issue: https://github.com/ovro-lwa/lwa-issues/issues/106.

lwa-project / ovro_data_recorder

use SNAP2 status to demo health monitor for bad ant-pols #14

Feature request

Intro

Feature

Example of on-SNAP logic