practable / relay

Secure websocket relay server and clients for sharing video, data, and ssh across firewall boundaries
GNU Affero General Public License v3.0
8 stars 2 forks source link

Add experiment checking service #16

Open timdrysdale opened 2 years ago

timdrysdale commented 2 years ago

In our installations, we have > 50 active experiments which is far too many to check manually.

An automatic checking system would be very helpful, particularly if it could (a) check booking manifest to find out what to check (b) check that the video and data connections are correctly made to the relay (c) make a test connection to the relay to check it is alive (d) make health checks on the experiments in some way (a common health-check command does not exist across firmware, but would be helpful if we did) (e) check the quality of results from experiments with regards to educational quality (f) have customisable schedules for checking experiments, and with what degree of effort (e.g. routine video and data checks using output from relay analytics channel every minute, daily functional checks on equipment) (g) add optional "demo mode" runs to exercise otherwise unused equipment - not strictly part of the remit of this feature, but it uses the same mechanics so will require similar features to be developed -> this suggests using files of commands for various tasks on experiments. (h) recording status to a central location (logging: TODO where/how) (h) displaying status and status history on a web page (protected by a token obtained using an URL parameter) (i) Automatically remove experiments from their current pools, and place them in a "quarantine" pool, noting where they came from before they were put in issues (j) alerting support to newly quarantined experiments (k) perhaps, routinely (say, hourly), checking functioning of experiments in quarantine pool and returning them to gen-pop in their original pool -if anf only if, the issue was a lack of video or data - such as might be caused by a power or network outage. If there is a data quality issues, assume kit is broken and do not run commands which could break it further. (l) maintain a master list of which tests apply to which activity types. (m) automatically establish the "normal" or "expected" parameters if an experiment has no previous entry in the norms file (TODO: where/how to store?)

Cassandra or disk-backed redis are candidates for the data stores here.

Tests can run in parallel, using go-pools. Test-runners obtain a lock on the right to do a test, by booking the equipment, and must return a result before the booking expires, or else another test-runner will eventually be allocated to run the test. Tests are 'valid' for an activity-dependent period. E.g. video and data access are valid for a number of seconds, while mechanical functioning can be considered reliable for a few hours. One way to implement demo mode on N=12 kits would be to simply set the mechanical functioning test validity to a say M = 45 seconds, limit max test-runners to four, and set the test task to take around 15 seconds, then four kits will be continually running.

timdrysdale commented 2 years ago

It would also be helpful to collect from the back end, the commands sent to the experiments so we can understand usage patterns/engagement i.e. quality of experience and use that compare different user interfaces and monitor updates. This would not require any data from client side, and we would not know who any of the users are, and we could only infer when sessions started and ended, unless we also cross reference to the booking system. The utility for this could be standalone for now, separate from the other features, and it should respond to HUP so that it can be used with logrotate