irods / irods_rule_engine_plugin_hard_links

0 stars 5 forks source link

alternative hard link implementation? #51

Open tsmeele opened 1 year ago

tsmeele commented 1 year ago

(just an idea, ignore if not appropriate)

The iRODS architecture appears to natively support hard links, although the mechanism hasn't been exposed. Consider the following:
In r_data_main, data object "foo" with replica's X and Y. Update the table row entry for Y and change "data_name" to "bar", without changing the data_id. Now both "foo" and "bar" can be used to refer to the same data object. An experiment has shown that the alias is preserved over an "imv" operation that moves the data object to another collection. Similarly, removal of the data object by one of its names will also remove the aliased replica (behavior as one would expect with hard links).

This implementation would not require AVU's. What say?

trel commented 1 year ago

I think getting this to work without metadata means that many other operations don't have enough information to be successful. The main reason this repository halted is that we realized a great many (possibly every) operation in the server would have to become hard-link-aware and have a separate implementation to behave accordingly. This was too much of a burden to maintain.

As an example... trim. Trimming a replica that is a hard-link of another data object's replica means that the other replica would no longer have a physical file 'underneath'. The bookkeeping required to handle this use case mandates somewhere to write down the counts and the code itself has to know about that bookkeeping.

Of course, happy to figure out a good way to do this - but for now, I think we've exhausted our own good ideas.

tsmeele commented 1 year ago

The suggested idea concerns a single data object that is accessible via two names. The iRODS catalog architecture already supports such a setup although the feature hasn't been exposed explicitly. Having replicas of multiple data objects point to the same data file would behave more like a soft link, and is indeed risky for the reasons that you outline.

trel commented 1 year ago

We drew some pictures in 2020 about this... https://irods.org/uploads/2020/Draughn-iRODS-Hard_Links_Rule_Engine_Plugin-slides.pdf

a single data object that is accessible via two names

A data object (data_id) only has one name. There are also many places in the code where a particular path resolves to a particular data_id and vice versa. Take for instance... SELECT DATA_NAME where DATA_ID = '23423'... That would/could now return more than one row? How would you expose this 'feature' to rename "part" of a data_object? In effect, renaming a "replica"... I think it breaks the model...

Equivalencies we were working with with designing this repository:

We've been avoiding supporting symlinks in iRODS because there are so many corner cases / cycle-detection / reference-following scenarios that are hard to reason about.

Hard links we thought we could handle, but the plugin approach loomed too burdensome to maintain - we'd have to pull the functionality into the server itself I think to safely support it.

tsmeele commented 1 year ago

The current data architecture appears to be the determining factor here. In table r_data_main, each row represents one replica, and the data object is modeled as a property shared across replicas, rather than living in a table on its own. This complicates linking between data objects.

trel commented 1 year ago

Yes, this is correct.

I believe a more 'correct' abstraction in the database would be to have both a data_objects table and a replicas table - but there would be an additional join on nearly ALL queries coming into the system. I suspect the original design was to provide a middle ground / optimization for the query-based usage. I think we would make a different decision today.