content addressable data structure benefits

chafey commented 4 years ago

@rvagg and I were discussing the need to create a list of content addressable benefits so I started this thread to help capture them. Will eventually move this into a spec page I suppose? Here are some off the top of my head:

1) Identity - data is self identifying, no need for externally generated identifiers (like database autoincrement fields). Think of git - workflows revolve around hashes to underlying commits. 2) De-duplication - a given block is only stored once even if it is part of many independent DAGs. This leads to reduced resource consumption (storage, bandwidth, memory), try to model data into chunks that are reusable. Not sure if git does this really, but ZFS does 3) Authentication/Validation - a block of data can always be authenticated or validated for correctness by checking its hash. Even one bit difference in the data will result in a different has allowing applications to detect bad data. Of particular interest is a clients ability to validate the data which eliminates the need to trust the server. Think of git, data is never corrupted because it is validated with hashes. 4) Immutability - there is no way to change data as doing so results in a new hash creating new data. Similiar to 3 above. 5) Change Tracking - since data is immutable, changes to it are essentially appends into a DAG creating a history of changes like you find in git. 6) Cacheability - since data is immutable, in can be cached without worrying about keeping it in sync with anyone else. Think of git, fetch operations are fast because it only fetches changes.
7) Syncability - large data graphs can easily be synchronized between two peers by traversing the graph and sending blocks that are missing. Think of git, fetch operations are fast and safe regardless of what state your repository is in 8) Disconnected / offline first designs - a disconnected or offline first approach to data management is a natural fit for content addressable data by leveraging CRDTs. Think of git, developers work offline all the time and synchronize back to master when they are ready. 9) Scalability - by treating your data store as disconnected/offline first, you can achieve unlimited scalability of data ingestion. Getting an entire view of the data can be done by merging trees or federated queries. Think of git - millions of developers are modifying repositories daily and there is no scalability issue 10) Reliability - related to 6,7,8 - you can make more resilient systems by making them disconnected/offline first. Think of git - you can always get work done, its just collaboration with others that isn't possible until you can connect to them over the network

rvagg commented 4 years ago

Good list, I felt that my list of 3 was more underwhelming that it should be and this expands it well.

This should go into ipld/docs, but here's my critique first:

"Identity" should go lower in the list, I'm still comfortable with my top 3 prioritised list: Authentication, Immutability, De-duplication and think they, especially the first two, deserve to headline. Identity isn't as obvious a benefit for the average use and it also overlaps with your other offline-first items. I think there's also risks in "self-identifying" potentially bumping up against "self-describing" so we'd want to be sure to frame this one right!
The last 4, and maybe also Change Tracking and Identity, all kind of fit under the browser "offline" / "distributed" and I wonder if they should be pulled into a sub-list. I'm also not sure that Scalability and Reliability are stand-outs as they are currently framed that rate their own dot-points. Git has no scalability issues? Depends on what you mean by that word.

Anyone else got items for this list before we try to doc-u-fy it?

chafey commented 4 years ago

The list wasn't prioritized and we may not want to do so as some properties will appeal to some people/applications more than others. I know that identity is really important in healthcare informatics, probably more so than authenticated. There is overlap between most of these which has made it really hard for me to communicate in a progressive manner (where do you start when everything is inter-related?). I do think that iterating on how we evangelize, communicate, teach the world this stuff is important as it requires a larger than normal mental investment by the audience right now

ipld / specs

content addressable data structure benefits #305