radiocosmology / alpenhorn

Alpenhorn is a service for managing an archive of scientific data.
MIT License
2 stars 1 forks source link

feat: Edge table and transfer race condition fixes #170

Closed ketiltrout closed 11 months ago

ketiltrout commented 11 months ago

This PR handles a few potential race conditions in the way CHIME moves data around its Storage graph. There are two parts.

Change to deletion

The easier part is a change to the way alpenhorn collects files for deletion. It now skips deleting a file copy which is needed to fulfill a copy request. It doesn't cancel the deletion, so it will consider deleting it again next time, but deletion won't actually occur until the blocking copy request is handled (completed or cancelled), avoiding a potential race condition in file management.

Edge table

The bigger change here is the addition of a table providing metadata for the edges in the Storage directed graph (edges in a directed graph are usually called "arrows"). This is the table StorageTransfer defined in storage.py.

We've talked about edge tables for a long time, and the PR implements almost nothing of what we've talked about, though there is potential to add more stuff later.

A StorageTransfer edge is defined by a node_from (source) and a group_to (destination). Any ArchiveFileCopyRequest with the same node_from and group_to is transferring data along that edge, and there's the potential to use the StorageTransfer to provide finer-grained configuration for specific routes. But, as I alluded to above, none of that is implemented here.

What is implemented is two post-file-adding actions which I've previously mentioned in #158 :

node_from group_to auto-sync auto-clean explanation
gong cedar_staging Y N files appearing on gong are automatically transferred to cedar_staging
cedar_staging cedar_offload Y N files arriving on cedar_staging are automatically transferred to cedar_offload
cedar_staging scinet_staging Y N files arriving on cedar_staging are automatically transferred to scinet_staging
cedar_staging scinet_hpss N Y files are deleted from cedar_staging after being archived in HPSS on scinet
cedar_offload cedar_nearline Y Y files arriving on cedar_offload are automatically transferred to nearline, and then deleted once they're in nearline
scinet_staging scinet_hpss Y Y files arriving on scinet_staging are automatically transferred to HPSS, and then deleted once they're in HPSS

The post-add actions (autosync and autoclean) are triggered whenever a file appears on a node (i.e. both via import and also pull requests).

Subtlety! The route a file takes does not matter when performing autosync and autoclean. So, e.g., in the example table above, autocleaning of cedar_staging will happen after files appear on HPSS, even though the cedar_staging->scinet_hpss edge is not used to transfer files into HPSS.

alpenhorn ignores all StorageTransfer records where node_from.group == group_to (i.e. edges pointing back to their origin, what are known as "self-loops" or 1-cycles).

Also: autoclean and autosync aren't complete replacments for cron-based alpenhorn clean/sync invocations. These actions only ever trigger once, when a file first appears on a node. It can't do a full sync/clean of a node.

Closes #158

ketiltrout commented 11 months ago

In addition to implementing Richard's suggestions, I've changed autosync to fire whenever the destination has has_file!='Y' (instead of just when has_file=='N'/is missing). I think this makes more sense.

In the case of has_file=='M', the autogenerated pull request will not be acted upon on until the existing destination copy is checked. After checking, if the resultant copy is set to has_file=='Y' alpenhorn will cancel the unnecessary pull request. On the other hand, if the file ends up being corrupt (has_file=='X'), then the autosync will happen to overwrite the destination file.