orbitdb-archive / ipfs-log

Append-only log CRDT on IPFS
https://orbitdb.github.io/ipfs-log/
MIT License
398 stars 55 forks source link

Feat/pass in tiebreaker func #209

Closed aphelionz closed 5 years ago

aphelionz commented 5 years ago

Description (Required)

This PR adds an optional argument to the Log constructor allowing one to add in their own sorting function to be used when the order of the log is calculated.

TODO

Resolves #204

haadcode commented 5 years ago

Would be good to add it to the docs and especially describe the expected function and its signature.

satazor commented 5 years ago

Looking at the PR it seems that lamport clocks are still being used and actually stored in entries, correct? This is something that concerns me because in a highly collaborative scenario, the clock will take a lot of space.

You need clocks to see if a replica has new data that you need to pull from it? In the original versidag scenario, the Merkle DAG causality and partial order properties are enough. We just need to order "concurrent" entries in the dag deterministically. For instance, using something like a timestamp would be sufficient in most scenarios.

Ideally, the lamport clocks shouldn't be part of the dag node entries or perhaps opt-in.

satazor commented 5 years ago

Expanding on this, you define a replication mechanism/protocol for various instances of ipfs-log to use and sync, that's why you need the clocks correct?

aphelionz commented 5 years ago

@satazor I'd like to merge that change separately from this one. For the sole purposes of passing in a custom tiebreaker function, will this suffice?

satazor commented 5 years ago

@aphelionz I would say that it looks good, one minor thing: what if the tiebreaker function returns 0? Should we allow that (non-deterministic) scenario? If not, then we should have a verification and raise an exception. If we allow, then we should wrap the tiebreaker so that when 0 is returned, we fallback to using something like comparing ids.

aphelionz commented 5 years ago

Very interesting thought, what are your thoughts @shamb0t and @haadcode?

aphelionz commented 5 years ago

Satazor, just to answer this question:

Expanding on this, you define a replication mechanism/protocol for various instances of ipfs-log to use and sync, that's why you need the clocks correct?

That's correct. We need a deterministic way to sort entries from separate logs and the Lamport Clock is a way of providing that for us

haadcode commented 5 years ago

Let's clarify on the concerns of Lamport Clocks:

This is something that concerns me because in a highly collaborative scenario, the clock will take a lot of space.

Lamport Clocks do not take a lot of space as the clock is always { id, timestamp } pair. The clock doesn't contain other peers' IDs or clocks. The growing clock size is an issue with vector clocks where each known ID will be part of the clock. We decided to use Lamport Clocks originally, after considered various others, because of that property and to enable "collaborations" where (new) peers come and go.

You need clocks to see if a replica has new data that you need to pull from it?

We do "replication synchronization" based on the entries themselves (=hashes) which gives us the notion of "is this new data or do I already have it". This works very nicely and in fact, this allows us to process updates partially (eg. when part of the "new log" contains part of the "old log" we stop processing), automatic deduplication due to content addressed storage :)

Note that the "replication synchronization" only happens when peers connect (they exchange their latest known heads) after which all communication happens through pubsub (so it's more "push" than "pull" in that sense).

Ideally, the lamport clocks shouldn't be part of the dag node entries or perhaps opt-in.

I believe with what @aphelionz has done here will do exactly that: by default the sorting (tiebreaker) uses the clocks, but if the users provides a custom sort function, then the clocks are not used.

haadcode commented 5 years ago

what if the tiebreaker function returns 0? Should we allow that (non-deterministic) scenario? If not, then we should have a verification and raise an exception. If we allow, then we should wrap the tiebreaker so that when 0 is returned, we fallback to using something like comparing ids.

Great point! The current default (LWW) strategy does exactly that (timestamp first, if 0 then IDs), but your point is very interesting in the case the user provides their own sorting function.

I suppose at the end of the day, the user provided tiebreaker is "the ultimate conflict resolution" meaning that it's the last authority to say how two entries compare. So from that perspective, I think your first proposal (verify and raise an exception) would make sense, plus I think it's good dev experience when the module/lib tells you when you're doing something wrong (and doesn't allow to continue).

At the same time, we do have the option to "wrap" it as you propose, and could fallback on providing the default "ultimate tiebreaker" when user-provided tiebreaker returns 0.

aphelionz commented 5 years ago

I propose the first option - raise an exception if the tiebreaker returns zero. That is explicit and will reduce confusion at every level. Wrapping the function could easily be perceived like "hidden magic."

aphelionz commented 5 years ago

@satazor @haadcode Ok, fixed the tests and added a thrown error if the comparator returns zero. Going to button up docs. Please look it over one more time, particularly the tests I have written, and let me know if you see anything else.

satazor commented 5 years ago

@aphelionz Looks good to me.

haadcode commented 5 years ago

LGTM 👍