This is the first of four PRs to move I/O operations to the DefaultIO framework.
Overview
Three parts of the update loop are handled here:
node check (checking for the ALPENHORN_NODE file).
free space update
integrity check of file copies with has_file=='M'
Also, I've made a change to the way third-party I/O classes are loaded; see the following section.
I/O module loading
Following discussion with Richard, this PR modifies how third-party I/O modules are found by alpenhorn from the behaviour originally introduced in Rewrite part 5/14 (#148).
The old behaviour was to simply have the full import path given in, say, StorageNode.io_class. Instead of doing that, this PR introduces a new extension type called "io-module" which extensions should use to provide alpenhorn with third-party I/O modules. The benefit here is all I/O modules are imported at start-up, rather than the first time alpenhorn sees a particular storage object.
The value associated with the "io-modules" key returned by register_extensions is a dictionary whose keys are I/O module names and whose values are the modules themselves. The third-party modules may not have names that duplicate any of the internal modules in alpenhorn.io, nor may multiple extensions provide modules with the same name.
The module name must follow the convention that the internal modules use: if there is a node or group with io_class equal to, say, MyCustomIO, then there must be an I/O module with name (=dict key) mycustomio (i.e. lowercased-version) and that I/O module must define the class MyCustomIONodeIO or MyCustomIOGroupIO as appropriate.
The prior behaviour of allowing a full import path in the io_class column is no longer allowed.
A missing I/O class at runtime is not fatal: if that happens, alpenhornd will report an error in the log and act as if the node or group that requested the I/O class is not active (i.e. the affected node/group is ignored during the update loop). Because an attempt at I/O class instantiation happens every time through the loop, the error message about the missing class will be repeated each time through the loop. This is intentional, to make it clear to operators why the node/group is not being updated.
Node check
This is a relatively minor change. The function util.alpenhorn_node_check becomes the I/O method node.io.check_avail (i.e. DefaultNodeIO.check_avail) but is essentially unchanged and is called in the same place in update.update_node_active.
Free space
The outer function, update.update_node_free_space is moved to util.update_avail_gb because it will be called from other places beside the main loop in the future. The function is changed to call node.io.bytes_avail where the actual stat occurs to fetch the amount of free space.
This PR also adds a boolean parameter fast to update_avail_gb which is passed on to node.io.bytes_avail. The invocation of update_avail_gb from the main loop has fast=False, but the other invocations will have fast=True. This boolean will allow I/O modules for which finding the free space is slow to skip the optional fast=True invocations.
The code also permits node.io.bytes_avail to return None, meaning the free space couldn't be determined (or doesn't make sense).
Integrity check
Running the integrity check is the first asynchronous task added to this rewrite (the function check_async). The asynchronous part of the check is put in alpenhorn/io/_default_asyncs.py, mostly for tidiness and to keep the size of Default.py manageable.
The integrity check is as follows:
Does the file exist? If not set has_file='N' and we're done.
Next, is the file size correct? If not set has_file='X' and we're done. This is a new step for this PR which avoids having to calculate the MD5 hash in some instances.
Calculate the MD5 and check it against the DB to determine has_file='Y' or has_file='X'
Other changes
The unused function util.is_md5_hash is deleted.
The parameter cmd_line is removed from util.md5sum_file. It was never set to anything other than the default. Removing it removes an unnecessary dependency on an external program.
FYI: Per the updated description, I've added to this PR a change in the way I/O loading works. (There's now an "io-module" extension that provides third-party I/O modules.)
This is the first of four PRs to move I/O operations to the
DefaultIO
framework.Overview
Three parts of the update loop are handled here:
ALPENHORN_NODE
file).has_file=='M'
Also, I've made a change to the way third-party I/O classes are loaded; see the following section.
I/O module loading
Following discussion with Richard, this PR modifies how third-party I/O modules are found by alpenhorn from the behaviour originally introduced in Rewrite part 5/14 (#148).
The old behaviour was to simply have the full import path given in, say,
StorageNode.io_class
. Instead of doing that, this PR introduces a new extension type called "io-module" which extensions should use to provide alpenhorn with third-party I/O modules. The benefit here is all I/O modules are imported at start-up, rather than the first time alpenhorn sees a particular storage object.The value associated with the "io-modules" key returned by
register_extensions
is a dictionary whose keys are I/O module names and whose values are the modules themselves. The third-party modules may not have names that duplicate any of the internal modules inalpenhorn.io
, nor may multiple extensions provide modules with the same name.The module name must follow the convention that the internal modules use: if there is a node or group with
io_class
equal to, say,MyCustomIO
, then there must be an I/O module with name (=dict key)mycustomio
(i.e. lowercased-version) and that I/O module must define the classMyCustomIONodeIO
orMyCustomIOGroupIO
as appropriate.The prior behaviour of allowing a full import path in the
io_class
column is no longer allowed.A missing I/O class at runtime is not fatal: if that happens,
alpenhornd
will report an error in the log and act as if the node or group that requested the I/O class is not active (i.e. the affected node/group is ignored during the update loop). Because an attempt at I/O class instantiation happens every time through the loop, the error message about the missing class will be repeated each time through the loop. This is intentional, to make it clear to operators why the node/group is not being updated.Node check
This is a relatively minor change. The function
util.alpenhorn_node_check
becomes the I/O methodnode.io.check_avail
(i.e.DefaultNodeIO.check_avail
) but is essentially unchanged and is called in the same place inupdate.update_node_active
.Free space
The outer function,
update.update_node_free_space
is moved toutil.update_avail_gb
because it will be called from other places beside the main loop in the future. The function is changed to callnode.io.bytes_avail
where the actual stat occurs to fetch the amount of free space.This PR also adds a boolean parameter
fast
toupdate_avail_gb
which is passed on tonode.io.bytes_avail
. The invocation ofupdate_avail_gb
from the main loop hasfast=False
, but the other invocations will havefast=True
. This boolean will allow I/O modules for which finding the free space is slow to skip the optionalfast=True
invocations.The code also permits
node.io.bytes_avail
to returnNone
, meaning the free space couldn't be determined (or doesn't make sense).Integrity check
Running the integrity check is the first asynchronous task added to this rewrite (the function
check_async
). The asynchronous part of the check is put inalpenhorn/io/_default_asyncs.py
, mostly for tidiness and to keep the size ofDefault.py
manageable.The integrity check is as follows:
has_file='N'
and we're done.has_file='X'
and we're done. This is a new step for this PR which avoids having to calculate the MD5 hash in some instances.has_file='Y'
orhas_file='X'
Other changes
util.is_md5_hash
is deleted.cmd_line
is removed fromutil.md5sum_file
. It was never set to anything other than the default. Removing it removes an unnecessary dependency on an external program.