mafintosh / hyperdb

Distributed scalable database
MIT License
753 stars 75 forks source link

To prevent mutation of history, use merkle root hashes in addition to sequence numbers #41

Open xloem opened 6 years ago

xloem commented 6 years ago

EDITED to include new discussion and work

hyperdb identifies node relationships using a feed key and a sequence number in that feed.

Because there is no hash related to the contents of the feed, it is possible for anybody who holds the private key of a feed in a hyperdb to alter the content. (This would mean either a duplicitous publisher, or a compromised private key.) Old clients will refuse to sync, but new clients could be unable to tell the difference. A hyperdb version does not identify the contents or history of the database uniquely at all in this scenario.

To move towards solving this rootHashes has been added to the hypercore api. Now, a single hash of all the root nodes in each feed should be added to each newly generated node, augmenting the use of sequence numbers and the vector clock(context, explanation). This should be verified when new nodes are read. The hash should be used when the HyperDB.prototype.version and HyperDB.prototype.checkout functions are used, to identify and verify versioned snapshots of the database.

xloem commented 6 years ago

@mafintosh when I originally looked at this, I wasn't aware of contentFeed, which isn't mentioned in the readme yet. The above solution wouldn't verify the integrity of the content feeds.

It seems the content feeds and the writer feeds must all have their root hashes hashed together, and this value included alongside the vector clock. In addition, so that the content feed root hashes may be checked, the length of the feeds at time of hashing must also be included.

Would you agree that's the way to go? I'm checking this solution with you because the extra length information would make the metadata bigger.

mafintosh commented 6 years ago

@xloem good point. let me think about this a bit. the usecase of the content feed is to jist store a pointer instead the value in the metadata to make it smaller, so it'd be good about adding too much extra data

xloem commented 6 years ago

@mafintosh I think a good solution might be to include only the length of the posting writer's content feed. An API function could be added to update this length if the content feed is written to without an accompanying metadata update, which would allow the user to make the tradeoff between synchronization and size when that happens.

As a side note, I've implemented some root hashing in https://github.com/xloem/hyperstream as a client of this library. I hash all the hyperdb roots with the posting writer's contentfeed roots as I suggest, and basically don't consider contentfeed data beyond the last hash. I will have more than one stream of content data produced by a source, which I plan to handle by creating multiple writers. Unfortunately my current solution involves accessing private members of hyperdb that aren't exposed via api functions yet.

xloem commented 6 years ago

hey @mafintosh, did you end up ever having an opinion on this? would you object if it was implemented in the way I suggest, by including the posting writer's content feed only in the root hashes?

mafintosh commented 6 years ago

@xloem I think it makes sense what you propose. We prob have to have this extra hash behind a flag for now if we impl it so send a PR :+1: