holochain-devcamp / learning-pathways

LeaP - a holochain collaborative and learning system
39 stars 11 forks source link

Entry updates refactoring #4

Open e-nastasia opened 4 years ago

e-nastasia commented 4 years ago

(thanks to @guillemcordoba for these explanations)

How it should be

In holochain, all activities should be idempotent: the order of global events should not affect the final state of the entries. There’s a couple of exceptions to that, but we can leave them for later.

This means that, for example, when we create entry A, remove entry A, and create entry A, the entry remains removed, since the remove_entry command has priority over the create ones.

What we have now

In LeaP we're updating the same entry multiple times with newer contents. But if you think about it, this can’t work, since two updates to the same entry are not idempotent (it matters A LOT in the global order of events, in this case which update came first). So this is not a behavior holochain should allow. This means that any entry can only be updated once. If we want to do another update, we need to update the second address. For example, if entry A is updated to entry B, I can no longer update to C using A’s address, but I have to update to C using B’s address.

Solution

We split the course entry into two:

  1. Anchor course entry: does not change, only contains information needed to make this course unique.
  2. Changing info entry: with the list of modules and the title, this is the entry that gets updated every time.

And every time we create or update a course, we don’t update the anchor, but only the changing info entry. And then we add a link from the unchanging anchor to the latest version of the changing one. This way, we can always query the anchor entry with get_links and the first one, which will we the last one added, will point to the latest version of the changing entry.

e-nastasia commented 4 years ago

To clarify, I would assign this issue on myself, but seems like I don't have enough access for that.

sevenshadow commented 4 years ago

Thanks for this insight. The feels like a pattern that will always need to be repeated with any entry. There will need to be this "primary key" entry that doesn't change, and then the data entry that can change. The primary key entry is always the point of access.

guillemcordoba commented 4 years ago

Thanks for bringing this forward!

Yes, I will talk about this pattern more deeply in our next call, since it's going to be pretty used.

pospi commented 4 years ago

@guillemcordoba I used it, not sure I'd recommend it. See https://github.com/holo-rea/holo-rea/issues/60

guillemcordoba commented 4 years ago

@pospi interesting! What didn't work for you with this pattern? And what alternatives are you following then? I'm intrigued, but didn't get a lot of clarity from the linked issue.

pospi commented 4 years ago

It is mainly in reflection to these "principles of distributed data architectures"- https://infocentral.org/drafts/PrinciplesDraft.html

Having a static identifier for a changing piece of data sort of goes against the grain of distributed systems, it's probably more idiomatic to use different IDs for every version of a record. Then it's always clear which version you are dealing with, whereas if you use a static ID it makes conflict resolution much more complex.

I think a revised pattern where you don't actually store such a "primary key" entry but instead return the latest version address when an old ID is used to request a record would be a more conflict-tolerant structure. And it also allows the UI to easily determine whether a record has been updated since it was last queried.

guillemcordoba commented 4 years ago

Hum I see... But then with the holochain primitives you're basically stuck referencing the first version of the entry and having to go through all the updated versions until you find the newest one? And also, where would you attach links from / to that entity?

Using this anchor / content pattern, we are trying to do both: the changing data always has a different hash, but it's very easily queried and all links to that entity are always attached to the anchor and thus don't have to be changed when changing the content.

pdaoust commented 4 years ago

@f00bar42 I like the depth of this conversation; it's great to see you and others digging deeply into the consequences of this and that aspect of consistency.

If I'm not mistaken, update_entry will automatically follow the update chain to the newest version of an entry that's being updated, so in the ideal case you should never have two history branches.

But in the non-ideal cases, the exact problem you describe is possible with the current design of Holochain. The edge cases I'm thinking of are:

Basically, any place where my view at commit-time is inconsistent with others' views. Which must always be considered as a possibility.

In all three cases, Holochain's 'update chain' design has the chance to create conflicting (branched) chains. The strategy for resolving this in an eventually consistent system is to let the conflicts stand as they are, and agree on a consistent way of interpreting the data in a way that resolves the conflict from your own perspective. Holochain considers something 'true' when all conflicting data is replicated to R DHT holders so they can resolve the conflict per their rules. (I think Holochain's default rules for CRUD is "delete always wins, otherwise the oldest update wins" -- or maybe it's the newest update?)

The proposed pattern ('primary key' anchor + most recent link) has similar risks, but they show up in different ways and can be resolved in DNA or UI code in the way that works best for the app. (That is, rather than the default rules. And FWIU Holochain will eventually offer a way to build app-specific resolution strategies right into the DNA -- something like a 'resolve CRUD conflict' callback function per entry type; I don't know exactly :) )

@pospi while I respect most of what the InfoCentral folks are saying, the part that makes me anxious is that at a certain point you probably want human-usable anchors as persistent identifiers for an entity. Most of the time you won't need this, because entities are found by virtue of their relationships to other entities that a participant already knows about.

And to directly address InfoCentral's article in light of this conversation, I agree that in many apps it's more important to talk about the entity 'as at x point in time' (e.g., agreements that change). But in other apps it's the entity itself that's more interesting than its revision history, so it makes sense to aggregate links on some concept of 'entity' rather than 'entity at x point'. Perhaps LeaP is one of those cases. Anyhow, what I'm saying is that it probably depends on the context?

guillemcordoba commented 4 years ago

Thanks for chiming in @pdaoust , I think that the defaults for resolving conflict have not even been thought of yet, since there are fundamental problems with any approach you take (there are extreme cases like network partitions).

With this pattern, we are trying to avoid this loop: getting the last version of an entry is expensive, as in it scales with the number of updates. We expect a course to have multiple possible updates, and this would mean a lot of performance loss. I also like much more the pattern of attaching links to the anchor, because then we don't have to visit the original entry at all (it is equally expensive to go from the latest version of an entry to the first one).

BTW, I'm a little troubled by the current implementation of update_entry... Mainly for these reasons:

The easiest solution I can think of is basically not getting the latest version of the entry when updating it. This may sound counter-intuitive but I'd much rather leave this to the app-level, since for example using this pattern of anchor->content I'm able to retrieve the last version of the entry with only a get_links, rather than going through the full change of updates.

So in general, I think the biggest problem is the mental model that devs form when they are interacting with holochain and the hdk, and which functions execute which requests, what is the performance of those requests and where is the metadata stored. I think it's critical that devs understand the fundamental mechanisms that they are using, to try to come up with efficient app designs.

I think it'd be great to put some effort into clarifying this with some visual tool... I'll try to come up with something simple.

sevenshadow commented 4 years ago

Thanks for all of the thinking in this thread. I understand the thought that "all activities should be idempotent," but it seems like if that is true it severely limits holochain as a platform unless:

  1. We can come up with a clever way to handle updates and transactions that are not idempotent by their nature but need to be
  2. We always have another data store which allows us to handle this type requirement. However, this seems to go against the whole notion of holochain or
  3. (Similar to 2) we only use holochain in some very specific use-cases that demand it.

From an adoption point of view, I am concerned that we are going to have difficulties if it is really hard or overly complicated to implement scalable CRUD functionality. Maybe it just takes a layer of abstraction over the top op holochain to handle this type of plumbing.

It may be argued that we are looking at this the wrong way (i.e. bringing our RDB way of thinking to holochain). That may be true, but I probably need some help and examples to think about how to think in the holochain way and still solve real-world problems.

Thanks everyone. I really appreciate you digging in here. Let me know if there is something I can do to help.

sevenshadow commented 4 years ago

Here is just a thought. If when we update (i.e. save a new entry). Maybe we could save a reference to the entry that the new (updated) entry was based on (hash and timestamp). That way, if someone needed to look for a conflict they could see if there were any updates between the one that they based off of and when someone else saved. (Maybe this is already in the plumbing now. I don't know.) Just throwing ideas out...

e-nastasia commented 4 years ago

@guillemcordoba @pdaoust @pospi @sevenshadow thank you for all the ideas here, I am learning a lot!

@guillemcordoba I agree about the update_entry method and confusion it creates in the mental model when learning about developing on Holochain. Perhaps phrasing it more like create_entry_update (and focusing on that during the explanations) would've helped because in this case it's easier to remember that we're actually creating a new piece of data.

@sevenshadow I remember an idea Art suggested about using Holochain as an infrastructure for some other storage that provides functionality that's desirable by an hApp but isn't possible in Holochain. Having this as an option might help with adoption even though it complicates things by adding new tools.

Maybe we could save a reference to the entry that the new (updated) entry was based on (hash and timestamp)

I like this idea! This can be one of the patterns for app-level conflict resolution.

guillemcordoba commented 4 years ago

From an adoption point of view, I am concerned that we are going to have difficulties if it is really hard or overly complicated to implement scalable CRUD functionality. Maybe it just takes a layer of abstraction over the top op holochain to handle this type of plumbing.

The pattern we are using in leap (anchor -> content) is already a good scalable and performant pattern. It's just that sometimes holochain requires you to split your "entities" that in a normal DB would be stored in only one table into two (in this case, stable ID and updatable content).

Maybe we could save a reference to the entry that the new (updated) entry was based on (hash and timestamp).

The thing is... we can't really rely on timestamp the updates. Imagine you bootstrap a network and create an entry, all the agents see it. Then the network is partitioned, and in one partition one agent updates it and in the other another agent updates it as well, pointing to different versions. And then the partitions join again. We cannot trust those timestamps, because this is too easily hacked: I can say I've been on a partition by myself (or maybe some companion bad agents) and put a timestamp previous to your update. Also holochain does not do global time, eg. does not assume that an event in a source chain with a lower timestamp than one in another source chain happened before... It's all about creating local traceable sequence of events.

e-nastasia commented 4 years ago

@guillemcordoba

The thing is... we can't really rely on timestamp the updates.

What if we save address of the previous entry without any timestamp? This would give us information about the source without relying on the time.

guillemcordoba commented 4 years ago

What if we save address of the previous entry without any timestamp? This would give us information about the source without relying on the time.

But in the case of update_entry, holochain will always keeep the last address of the entry (now that's not the case). The partition scenario looks like:

When the partitions join, you have entry A updated to B and entry A updated to C. So how do you decide?

sevenshadow commented 4 years ago

Thanks @guillemcordoba for your expert response. I appreciate it. I

e-nastasia commented 4 years ago

@guillemcordoba

But in the case of update_entry, holochain will always keeep the last address of the entry (now that's not the case).

Not sure I get this one. Do you mean the address previously used for the update would be available when doing the next update?

When the partitions join, you have entry A updated to B and entry A updated to C. So how do you decide?

Yes, it wouldn't be possible to automate without additional data. My thought was that having reference of the "base" entry version (the one current update is based on) would only help to develop some application-dependent conflict resolution mechanism.

guillemcordoba commented 4 years ago

@f00bar42 So, your second insight is exactly right: this is the DHT conflict resolution callback mechanism that will be implemented in holochain.

As far as your first question goes... Holochain's headers contain an optional "replaces_entry_address" field in which contains the hash of the previous version of the entry if you call update_entry (at least it will contain that haha). This is how we know what the previous version of the entry was at the time it updated. Does this answer your question?

Let me try one thing though... Can you go to the holochain playground I made this last weekend and create an entry, and then update it? It's a little experiment that maybe can give you the technical details of what's happening under the hood.