Rearchitect indexing integrity zomes to fix hidden link conflicts & storage bloat

pospi commented 1 year ago

With the changes to the HDK actioned in #340, data dependencies between controller zomes are now much more explicit than before and managed by regular Rust module system semantics. Where previously get_links used to return only links created in the local zome, now it can return links managed by other controller zomes if they are all acting upon the same integrity zome crates.

What this means for our use-case is that we ended up with duplicate links between entries in our semantic index zomes and had to move to an architecture of multiple distinct integrity zomes in order to keep the separation, per 1654968b5249cdc52e05153c37e6fdc092d94243.

However this creates (preserves?) a sub-optimal situation where locally-adjacent index zomes (eg. hc_zome_rea_economic_event_index_observation & hc_zome_rea_process_index_observation) are actually storing duplicates of the same data.

This is ideal & necessary for indexing that occurs in different cells, since both sides of an index need links to be made bidirectionally in order to run "read" & "query" actions— "reads" walk from a local record ID to the referenced ones and "queries" walk from referenced record IDs back to local ones. But in locally-adjacent indexes, these are duplicate sets of links.

It would be best for the sake of architectural simplicity if indexing code remained the same in both situations, as it makes composing indexes into various zome/cell combinations relatively straightforward.

So, the proposed path forward in order to optimise the storage efficiency of this situation whilst preserving the consistency of all semantic index type implementations is as follows:

[x] revert the above commit, which will restore the system to a state where all indexing zomes share the same integrity zome crate.
[x] alter the create_index behaviour in hdk_semantic_indexes_zome_lib to use link_if_not_linked instead of create_link

This should resolve all of the above, but needs to be tested. The end result will be such that for locally-adjacent indexes, an extra unnecessary API endpoint will be called that has no effect (since the other side of the index will have already created the links), but this is of minimal concern as it only affects the local node (no extra data will be replicated by the Holochain subconscious).

It's also worth noting that this foregoes a potentially useful future use-case for the semantic indexing library, whereby implementors might create multiple links between the same records in order to weight them. But it's probably fine to let this possibility go and not going to affect hREA.

pospi commented 1 year ago

Made the above changes as an experiment and it looks like a few ordering bugs come up as a result. These tests fail afterward:

    x container resource reference B OK
    x container resource reference A OK
    x should take on the last PASS | FAIL event action as its state
    x should take on the stage of the most recent event's related output ProcessSpecification
    x Intent.satisfiedBy reference 1 OK
    x Intent.satisfiedBy reference 2 OK

There might be deeper issues at play here... worth investigating to figure out why these orderings are failing but others aren't. Some tests may have been relying on multiple links being created.

pospi commented 1 year ago

Expanding the scope of this issue after determining the cause and a bit of discussion.

[ ] add validations to time index links such that records can only ever be indexed once per index.
[ ] allow the index creation logic to error and handle those validation errors gracefully for locally-adjacent indexes, which might be triggered twice.

The remaining work mentioned in #264 issue for post_commit hooks and creation time validation should be implemented separately as a future scope of work.

pospi commented 1 year ago

Doing either of the above is not a good pattern for Holochain app validation. Nondeterministic validation (based on Action creation time in a partitioned network) leads to potential integrity issues, however minor. Better to allow multiple indexes to be created and de-duplicate the links in some way upon reading. The implementation in this PR is to pick the first observable created link as the source of truth for the record in the given time index.

The future core feature needed to guard against spamming of the DHT is something called "buckets".

Via @pdaoust:

Basically the way it'll work is that you'll be able to define all sorts of buckets for public data (not sure if it's per entry/link type or completely arbitrary), and each write fills up its respective bucket just a little more. The bucket has a hole in the bottom, though, so it drains out over time. If you write too many entries for a given bucket within a given time period, it fills up and rejects the one that overflows it.

Because the source chain is a linear list of dependencies (each action depends on all the ones that came before it), this can be validated deterministically by agent activity authorities -- which means that spamming can become a warrantable offense.

pospi commented 1 year ago

Was thinking about this today and another benefit should be a more - than - modest performance increase in making initial calls to a zome (especially for calls involving many zomes, such as an EconomicEvent creation with an inventoried EconomicResource). 1 indexing integrity zome means 1 shared WASM between all indexing zomes means faster boot time & lower memory & storage overhead.

Can anyone confirm this anecdotally?

Also adds to the appeal of combining indexing zomes across a DNA into a single zome- this would mean faster boot time, smaller overall size on disk and less memory overhead for coordinator zomes as well. (And at this point the difference is really just a matter of shifting declarative config from YAML files into the Rust macro layer.)

h-REA / hREA

Rearchitect indexing integrity zomes to fix hidden link conflicts & storage bloat #356