radiocosmology / alpenhorn

Alpenhorn is a service for managing an archive of scientific data.
MIT License
2 stars 1 forks source link

Rewrite 6/14: The "import-detect" Extension and removal of Type and Info classes #149

Closed ketiltrout closed 1 year ago

ketiltrout commented 1 year ago

This PR is concerned with implementing the "file import detection" framework for the daemon. I think this is the last of the "structural" PRs. Subsequent PRs in this rewrite will deal primarily with changes to I/O code.

Motivation

The ultimate goal of this PR is to produce the infrastructure needed by the CHIME alpenhorn extensions over in alpenhorn-chime.

This PR does the following:

Removal of the Info framework

I've removed all reference to info classes from alpenhorn. They were an integral part of alpenhorn-1, but in alpenhorn-2 they served two purposes:

The first of these features has been replaced by a new "import-detect" extension type which provides a simple function which will perform the detection step of the import. See the "The 'import-detect' Extension" section below.

The second of these functions is replaced with an optional post-import hook, which removes the awkwardness of requiring alpenhorn to add rows to tables it knows nothing about. See the "The post-import Callback" section below.

Removal of AcqType and FileType

While CHIME makes heavy use of AcqType and FileType to manage our data, in alpenhorn their use was solely to determine which Info tables were available to perform import detection. With the removal of info classes, they no longer have a use in alpenhorn. I've moved them to alpenhorn-chime where they've been re-implemented (like the ArchiveInst table was).

Removal of these two tables also means the acq_types and file_types extensions are no longer needed, and they have been removed from extensions.py, as well as the register_type_extensions call that was being made in service.py.

The "import-detect" Extension

In place of all the above is a new extension type called "import-detect". Each "import-detect" extension returns (via register_extension) a single callable object, which is the "detect" function used during file import.

(This is exactly what alpenhorn-chime is: an alpenhhorn import-detect extension.)

The detect function is passed a pathlib.Path pointing to the file to import and the UpdateableNode containing the candidate data file. The function must determine if the path points to a file that alpenhorn should import. It must return a 2-tuple:

Multiple "import-detect" extensions may be loaded. In that case, the import code tries each in extension order until one of them reports a successful match.

alpenhorn will run without any "import-detect" extensions loaded, but will be unable to import files in that case. (Attempts to import files will result in an error message).

The Post-Import Callback

When provided, alpenhorn will pass to the callback the following parameters:

The ArchiveFile and ArchiveAcq of any imported file may be obtained via the ArchiveFileCopy; newly-created instances are passed to the callback so the callback knows when they are new or not. These two parameters could be replaced by booleans without loss of information, but I think it's more direct to do it like this.

The value returned from the callback is ignored.

The "pattern-importer" example extension

The regex/glob-based example extension formerly found in alpenhorn/generic.py (which was an example of an "acq_types"/"file_types" extension) has been moved to examples/pattern-importer.py and updated to be an "import-detect".

It's not used yet but this example extension will also eventually be used in the end-to-end test in tests/test_service.py.

Changes to auto_import

The changes here are somewhat performative: they are the code changes needed to use the new info class system, but the auto_import code doesn't really work yet within the new framework. A subsequent PR in this series will transition the import code to use the new task queue. As part of that, this code will get fixed. Despite that, it's good to make this change here to show how the changes to info classes affect the calling code.

ketiltrout commented 1 year ago

This is a major rewrite of this PR, which I think makes things better by making alpenhorn simpler, though paring away weird CHIME-isms which don't contribute to alpenhorn's goals.

Instead of the complex Info-table framework and the model-extension stuff I had written, now I've just dropped Info tables completely from alpenhorn because, like ArchiveInst, alpenhorn doesn't care about them and shouldn't be the one dealing with them.

And then, by dropping info tables, we can also drop AcqType and FileType, because all they were doing for alpenhorn was telling it which info classes to interrogate for which imported files, so without info tables, they have nothing to do. It's better to let the third parties, decide how to organize their data rather than arbitrarily forcing something on them for which we have no need.

(For CHIME, all of these dropped things are moved/reimplemented in alpenhorn-chime).

Overall I think it's cleaner, more intuitive, and more flexible without losing anything CHIME needs.