Decentralized (vs. Distributed vs. Federated)

linas commented 5 years ago

Ongoing remarks about a decentralized atomspace. Pertains to issues #1855 #1502 #1967 The key concepts are:

distributed - there is one database (atomspace), but it is distributed over many machines. local copies of the atomspace are just "views" into a single, central "whole" atomspace.
federated - multiple atomsapces, each atomspace is accessible from an atomspace server, with atomspace updates are published via updates using zeromq, or REST or protobuf or whatever. Each server is authoritative for its data.
decentralized - multiple atomspaces, but without an atomspace server. The concept of "authoritative" version of the data is not baked into the system. Authority is determined at a different layer (e.g. consensus, collaboration, voting, respect, kudos, brute-force, violence, bribery, whatever).

These three concepts are often confused with one-another, and are taken to be synonyms, they are not. Comments below unpack these in greater detail, listing the pros and cons.

linas commented 5 years ago

A distributed atomspace attempts to maintain the illusion that there is one single set of data, of which any particular machine might just hold a few shards. The concepts of "ACID" and "BASE" apply. So for example, atomic updates, vs. eventually consistent are both strategies for updating data in such a way that one maintains a consistent data state. (either immediately, by locking: "ACID", or eventually, by propagation of updates: "BASE")

The pros and cons:

Plus: when its distributed, you can have a bigger dataset: a dataset that is too big to fit on one machine.
Minus: issues arise with data ownership and contextual knowledge. The data ownership problem is here: #1855 where there is one very large "master" dataset, should be treated as read-only, containing the "best" copy of data (e.g. genomic data) and then multiple read-write deltas to it, created/explored by individual researchers who want to try new algos out, without wrecking the master copy. And don't want to make a full copy of the master.
The context link (issue #1967) only "partly" solves this. It still implicitly assumes that there's a single master. The very concept of "master copy" is still baked into the system.

Please note that the atomspace is already distributed. See the demo here This can be made to work on a large scale, because postgres is already massively scalable. So in a certain sense, that part is done. What is unsolved is the multi-user and authority-of-update issues surrounding this.

linas commented 5 years ago

A decentralized atomspace acknowledges that there is no single master copy, and that instead there are peers. Now some peers might be more authoritative, more correct, more knowledgeable than other peers, but the process for determining who is authoritative can be made to lie outside of the atomspace implementation. Determination of Authority is done at some other layer, and not hard-wired into the atomspace design.

Pros and cons:

Plus: performance is still good, because you can still have a copy of the data that you need, locally, in RAM.
Cons: the stuff that does not fit in RAM still has to go somewhere. Today, that means in your postgres backend.
Cons: the machanics of decentralization are unclear, and need to be worked out.

linas commented 5 years ago

The concept of federation is that everybody runs their own server, and they exchange data with one-another. Classic examples of federation are email-servers, IRC servers, diaspora pods, etc. That is, there are owners/admins who run the server, and lots of users who use the server.

For the atomspace, users communicate with the servers using REST, or protobuff, or zeromq or ROS messages or whatever. (I don't care, as long as the performance is good and the API is maintained)

Pros and cons:

Federation gives the appearance of decentralization, without actually providing it. Example, its a lot easier to use gmail or yahoo, than it is to install and operate your own mail server.
Federation often leads to lowest-common-denominator feature set. If server A has whizz-bang feature that server B does not support, then all users of server-B lose. Worse, the whizz-bang feature never catches on in popularity; its blocked by adoption speedbumps. That's one reason why "web 1.0" standards like email and irc remain stuck in the backwaters.
lowest-common-denominators are killed by walled gardens. Facebook helped kill email. Slack help kill IRC. But now you are locked into a walled garden, that you cannot escape.

linas commented 5 years ago

The goal of this issue is to define some way of having decentralized atomspaces without the down-sides of federation, and without the authority-control issues of a distributed atomspace.

linas commented 4 years ago

Notes:

Read the "Big Graph Anti-Pattern" blog post: https://blog.blazegraph.com/?p=628 which implies that blazegraph is a good solution (of course it does, they are selling their own product) http://sourceforge.net/projects/bigdata/ http://www.blazegraph.com/

linas commented 4 years ago

The https://github.com/opencog/atomspace-cog/ client-server implementation provides a reasonably-fast quasi-peer-to-peer quasi-distributed atomspace. A collection of these could provide a true decentralized implementation if two things are provided:

A way of locating other peers, with whom cooperative messaging/distribution could be provided. A corner-stone of peer-peer communication are ached query results, provided by https://github.com/opencog/atomspace-cog/ (See examples, remote-query.scm.)
A simple, fast file-storage system. This provides two things: access to "more data than fits in RAM" and persistance in the face of power-outages as well as ability to manage datasets via file-system and/or distribute large datasets via network filesystems (ipfs but also web...)

linas commented 4 years ago

Candidate key-value stores:

LevellDB - https://github.com/google/leveldb - Single-user, fast, optimized for disk. C++. Works with std::string (yay! no UUID's!) Has Debian and Ubuntu packages.
RocksDB - https://rocksdb.org C++ Has Debian and Ubuntu packages. Fork of leveldb, with optimizations for Flash.
HyperLevelDB - fork of LevelDB, optimized for write
PebblesDB - fork of HyperlevelDB, optimized for write. Claims to fix read performance problems of hyperlevel. Drop-in compatible with leveldb and hyperleveldb. NO debian/ubuntu packages!

Rocks seems better and more balanced performance. leveldb seems to have 15% smaller files than rocks. hyperlevel trades 2x faster write for 4.5x slower query.

Done. See https://github.com/opencog/atomspace-rocks/

linas commented 4 years ago

Assembling pieces-parts: https://github.com/opencog/atomspace-agents/

Why: copy of long email:

I want to talk about "service meshes". The problem with shopping for cassandra, or any of the other suggested databases, is that they are all "monolithic black boxes". You pick one, and you get what you get: whatever is provided, that's what it is. Sure, some configuration files somewhere allow you to tune this and that, but that's all.

The service mesh idea (and the npm/js idea before that) is to assemble your system out of small, self-contained pieces. Sure, the object-oriented folks have been talking about this for 3 or 4 decades, and it's cited as the raison-d'etre for things like C++. But C++ never lived up to this ideal. There are no generic C++ frameworks. None. At All. (OK, so SGI had one or two in the early 1990's ...) Something is ... missing... in C++. Compare this to node.js and npm which are wildly successful over-achievers in this category. People regularly build large applications by assembling a cacophony of tiny little javascript parts. Clearly, javascript has something that C++ does not. Something that makes the OO dream achievable not just in theory, but regularly validated in practice.

Now, there are some down-sides to npm apps: they contain hundreds or thousands of parts, and not all of them are well-maintained, and many have published security vulnerabilities that remain unpatched. Worse, patching some of them require incompatible API changes that would break users. So it has its own prickly and thorny issues that are unique and different from those that other languages (python, scheme, c++) suffer from.

In the cloud world, there has long been, and continues to be a movement to meshes of containerized applications. Here, docker is the prototypical container -- lxc/lxd/lxe more generally. Managing these containers requires kubernetes, and more: the "service meshes" (istio, microsoft open service mesh) provide a layer (a "control plane") that further manages deployments, error fallbacks, a/b testing, circuit-breakers, load-balancing, etc. The mental model is that containerized apps are just like npm nodes, except they are million times bigger and beefier (literally) and they all have network interfaces instead of javascript methods/objects. And since they are so much bigger, they need more active management.

Now compare the service-mesh idea to the olde-fashioned ideas of "web shopping carts" or "content management systems" or "customer relationship management systems". Those things were single, monolithic black boxes that you bought from a vendor (or installed via open-source) that automagically did everything for you, once you configured a few templates. They worked great, as long as what you wanted was (a) a web shopping cart, and (b) was customizable via some template or config file. If not .. you were SOL.

These monolithic architectures were their downfall, were the driver to containers, kubernetes and service meshes. The founders of cloud startup XYZ can't spray-paint some config files onto a monolith and then raise $20M in venture funding. But, give them a bunch of pieces-parts containers, that they can hook up in some new, novel and exciting way, plus a little secret sauce, and buzzword-bingo, a unicorn is born.

And this is why Cassandra makes me yawn with disinterest, if not a bit of hostility. It's a big monolithic block. Sure, I can take the AtomSpace, and plaster it onto Cassandra, like wrapping some wet paper around a rock. The ultimate shape is still that of the rock, no matter how brightly-colored or thoughtful that paper wrapped around it is.

So, I'm trying to grab hold of this idea of pieces-parts. OpenCog needs pieces-parts that can be arranged and re-assembled into that mesh that provides the distributed-atomspace attributes and requirements du-jour.

Yes, of course, singularity.net is also pursuing a vision of pieces-parts that can be assembled. Which is why I am a bit dumb-founded that we are entertaining ideas like Cassandra -- it is the very antithesis of modular architecture. It's the opposite of a dapp -- It's a big giant lump, the one ring to rule them all. It's kind of exactly the poster-child for what not to do ...

For a distributed atomspace, what we really need to focus on is inter-operability, so that, like javascript (and unlike c++) it is easy to assemble modules out of other modules. Like containers, there should be some fairly regularized API for communications (I nominate atomese-as-ascii-strings i.e. s-expressions and maybe plan-B atomese-as-json). With this under control, we can move on to creating unique, custom services aka agents aka dapps or whatever these other things might be.

linas commented 1 year ago

Closing. Everything here is now possible with a combination of ProxyNodes and StorageNodes See https://wiki.opencog.org/w/Networked_AtomSpaces for details.

opencog / atomspace

Decentralized (vs. Distributed vs. Federated) #2138