radiocosmology / alpenhorn

Alpenhorn is a service for managing an archive of scientific data.
MIT License
2 stars 1 forks source link

Rewrite 7/14: DefaultIO housekeeping + integrity check #150

Closed ketiltrout closed 1 year ago

ketiltrout commented 1 year ago

This is the first of four PRs to move I/O operations to the DefaultIO framework.

Overview

Three parts of the update loop are handled here:

Also, I've made a change to the way third-party I/O classes are loaded; see the following section.

I/O module loading

Following discussion with Richard, this PR modifies how third-party I/O modules are found by alpenhorn from the behaviour originally introduced in Rewrite part 5/14 (#148).

The old behaviour was to simply have the full import path given in, say, StorageNode.io_class. Instead of doing that, this PR introduces a new extension type called "io-module" which extensions should use to provide alpenhorn with third-party I/O modules. The benefit here is all I/O modules are imported at start-up, rather than the first time alpenhorn sees a particular storage object.

The value associated with the "io-modules" key returned by register_extensions is a dictionary whose keys are I/O module names and whose values are the modules themselves. The third-party modules may not have names that duplicate any of the internal modules in alpenhorn.io, nor may multiple extensions provide modules with the same name.

The module name must follow the convention that the internal modules use: if there is a node or group with io_class equal to, say, MyCustomIO, then there must be an I/O module with name (=dict key) mycustomio (i.e. lowercased-version) and that I/O module must define the class MyCustomIONodeIO or MyCustomIOGroupIO as appropriate.

The prior behaviour of allowing a full import path in the io_class column is no longer allowed.

A missing I/O class at runtime is not fatal: if that happens, alpenhornd will report an error in the log and act as if the node or group that requested the I/O class is not active (i.e. the affected node/group is ignored during the update loop). Because an attempt at I/O class instantiation happens every time through the loop, the error message about the missing class will be repeated each time through the loop. This is intentional, to make it clear to operators why the node/group is not being updated.

Node check

This is a relatively minor change. The function util.alpenhorn_node_check becomes the I/O method node.io.check_avail (i.e. DefaultNodeIO.check_avail) but is essentially unchanged and is called in the same place in update.update_node_active.

Free space

The outer function, update.update_node_free_space is moved to util.update_avail_gb because it will be called from other places beside the main loop in the future. The function is changed to call node.io.bytes_avail where the actual stat occurs to fetch the amount of free space.

This PR also adds a boolean parameter fast to update_avail_gb which is passed on to node.io.bytes_avail. The invocation of update_avail_gb from the main loop has fast=False, but the other invocations will have fast=True. This boolean will allow I/O modules for which finding the free space is slow to skip the optional fast=True invocations.

The code also permits node.io.bytes_avail to return None, meaning the free space couldn't be determined (or doesn't make sense).

Integrity check

Running the integrity check is the first asynchronous task added to this rewrite (the function check_async). The asynchronous part of the check is put in alpenhorn/io/_default_asyncs.py, mostly for tidiness and to keep the size of Default.py manageable.

The integrity check is as follows:

Other changes

ketiltrout commented 1 year ago

FYI: Per the updated description, I've added to this PR a change in the way I/O loading works. (There's now an "io-module" extension that provides third-party I/O modules.)