radiocosmology / alpenhorn

Alpenhorn is a service for managing an archive of scientific data.
MIT License
2 stars 1 forks source link

Rewrite 14/14: Idle updates #157

Closed ketiltrout closed 1 year ago

ketiltrout commented 1 year ago

This is the last PR of the rewrite. It adds two task which are performed only when a node is idle (has no I/O tasks in the queue) during an update loop:

The QueryWalker

This PR introduces a QueryWalker in querywalker.py which facilitates both these tasks: it takes a table name (model) and (optionally) a bunch of condition expressions and walks through the matched rows every time get() is called. The query starts at a random place in the results. This is done that so short runs of alpenhorn won't always end up running over the same set of records.

Config

For the HSM state check, the number of items to do at a time is given in the nearline I/O config.

For the re-verify check, this PR add a new field to the StorageNode model: StorageNode.auto_verify which is the number of files to auto-verify per idle update loop. If this field is zero (the default), then auto-verification is not done.

Note

Originally, this was just the HSM update thing, but then I realised the query-walker could also be used to implement this auto-re-verification, which I've been thinking about doing for a while, even though it's not really relevant to the rest of this rewrite

Issues closed by the rewrite as a whole

Closes #75 Closes #93 Closes #136 Closes #140

ketiltrout commented 1 year ago

Updated to prevent over-verification