The CocoNet proposal: Resource balancing and scalability via fractalization

douglasism commented 4 years ago

Preface As a way to learn about CCF and attempt to reach an understanding of its design and implementation, I've been imagining how it might be deployed at scale. As soon as I began to consider this, I started wondering about how all of these nodes would be hooked up and how the consortium would pay for all of the network traffic and the compute cycles and the storage. (Costs are on my mind because I let my tinkering in Azure get out of hand last fall --- it ain't free.) This exercise has several benefits for me, and I hope that you don't stop reading my proposal when I admit them.

First, I don't have much experience running large scale operations. This exercise is a driver to help me at least become conversant about large-scale operations. Second, the interconnected weave of security guarantees in CCF is quite complex and as I'm still learning about it, I can't yet say that I'm aware of and understand it all. There are gaps. Further, this is a 0.1 proposal, without any previous review. I don't claim that it is sound. My purpose is to generate discussion around these ideas.

If you are still reading this, great! Let me get right to it then. Here is a diagram that is intended to capture the current design. It's based on a diagram in the CCF documentation. I will use it as a basis as I develop these ideas. I'll call this the open design.

The "Open" Design

As I've said, this is as it is today. I won't dwell on it.

Next up is a configuration that @eddyashton mentioned in the CCF chat room. (I hope I represent his design correctly.) In this network configuration, only users that are associated with a member have access to that member's network interfaces. This is accomplished through the member’s network infrastructure and would not require any design or implementation change for CCF.

The "Network Segment" Design

What does it achieve? In my miserly design exercise, I don't like the prospect of my friendly consortium partner having their 1.2 million customers channel transactions through my resources. Let that traffic go through their consortium network nodes! Of course, at some point, my systems will carry some of the load of their customer's transactions, given that we have a common ledger, but at least the initial phase should be handled by their systems. (I also have an idea about how to get paid for that too. I’ll cover it later.)

So far, nothing has been proposed that would require a design or implementation change. But the next variation will. I call this the "CCF Segment" design. It accomplishes what the "Network Segment" design does, but CCF does it natively. In addition to what "Network Segment" did via network configuration, this design proposal lays a part of the foundation for the central idea of this proposal: the CoCoNet, which we'll come to soon.

The "CCF Segment" Design

The core shift of the "CCF Segment" design is that the users become fully associated and managed by the member. The users table is private to the member. There is no proposal nor is there voting on admission by other consortium members. (I’m still perplexed with the concept that members would have the practical means of vetting other member’s proposed “users”. With a decentralized, heterogeneous consortium, how would that process work? What is the common trusted authority for identity and authorization across the consortium? Beyond validating the presented cert, how would voting members know if AAAAB3NzaC1yc2EAAAADAQABAAABAQD should be allowed to call the application interface?) A new "user" entity becomes known throughout the consortium through “account” creation in the common ledger. The user's status is maintained in the ledger globally, and privately within the member's identity and authorization scheme, whatever that may be. The member is vouching for the identity of the user. If there are bad transactions from that pubkey, the responsibility is partly the member’s. They designed and run their identity and authorization systems. There is more to be said here, but I will move on.

And now the CocoNet proposal! Thanks for staying with me so far. Please take a look at this diagram:

Edit This idea is flawed. It has been retracted. Please see my comments below.

The CocoNet phase 1: Consortium spin-up

This is quite like today's design. But notice there are no users. This CCF network was loaded and spun up without any users (in the current design's sense). And perhaps your interest is piqued by the nodes labeled "gateway". Let's view the next phase of the CocoNet. Users are about to arrive!

The CocoNet phase 2: Member Lithium arrives

Edit This idea is flawed. It has been retracted. Please see my comments below.

This is a core element of the proposal: Lithium operates a fully functional CCF sub-network with the major, significant caveat that it has to connect to its counterpart gateway in the core consortium network for its sub-CCF network to be joined --- for it to receive the coconet keys. At this point, with just one member's users, the consortium is fully operational. This proposal leaves most of the current design and implementation intact. The operation of the subnets are very much like the operation of the core consortium network. It's fractalish. Let's bring in Krypton's user base. There are many!

The CocoNet phase 3: Member Krypton arrives with its large user base

Edit This idea is flawed. It has been retracted. Please see my comments below.

Replication occurs across the entire consortium and member subnets. The same algorithms are applied across all nodes. The subnets collect transaction blocks and forward them to the gateway, in the core consortium. In this role, the subnet 0 node -- the subnet's primary node --- also behaves like a common node, relative to the core consortium, where the core ledger update occurs. The results are fed back through the gateways to the subnets. The whole is made of fractal-like subnets, but operates as if it were one CCF network.

eddyashton commented 4 years ago

Thanks for the detailed descriptions! There's definitely some details here we need to explore further.

Questions

I have a few questions that would help us better understand your coconet proposal:

What state is held in the Lithium coconet?
What does a write in the Lithium coconet do?

I think your proposal is similar to sidechains in the Ethereum ecosystem, where the sub-networks are fundamentally independent but, in normal operation, work with authoritative data from the central network and store some results (or a derivation of them) on the central network. Does this sound correct?

As an initial response I'd like to explore how much of this functionality may already be possible in CCF.

Fair billing

The short answer here is that CCF should be unopinionated on how consortium members allocate and pay for the service's resources. They may decide that costs are apportioned per-node (each member must run N nodes), per-member (each member pays a flat fee, a separate operator provides minimum performance guarantees), per-transaction (the app counts tracks a transaction quota), or perhaps other models we haven't thought of. We believe we can support all of these under the current design.

Member-to-user association

We can track these within app logic rather than purely by restricting node access. As I see it, your proposed requirements are:

each user is part of a specific member's organisation
members don't care about the precise identity of other organisation's users
members want to be able to restrict/audit each counterparty's transaction volume for billing fairness

This is possible in the current code. The governance script would allow members to add new users without a vote, but require that these proposals also associated the new user with the proposing member (by listing the member's ID in the user's user_data). Then the app logic can make a choice based on the member ID that is 'responsible' for the incoming transaction - perhaps decrementing a governance-controlled per-member transaction quota, perhaps only revealing this organisation's private data, perhaps counting the per-org transactions to be examined later.

Sub-networks

I think the safe way to implement org-specific networks or side chains is to reason about them as ordinary users. It may be that the hosted app is a central, authoritative database, and each organisation mirrors that on a local system for their own workloads (and perhaps that local mirror-system is also a CCF service). But if we treat these mirrors as users (ie - some authorised proxy is fetching state from the central service, and submitting write requests produced by the local service) then we understand the consensus model (the user should generally only work with globally committed values, anything else may be rolled back). If we start talking about sub-networks, or gateway nodes that are part of 2 networks (?), then we need to think carefully about what state they're sharing and how consensus is reached over it. In particular, this risks the stumbling block of many other blockchain systems where private data is stored only in specific nodes or sub-networks, making cross-org consensus and transactions more awkward.

User identity management

The member responsibility for matching a user cert with a real-world identity will vary significantly between CCF instances. In the organisation consortium case you discuss it makes sense to build a member-to-user mapping as outlined above, to track a member -> set-of-users relationship, with each member solely responsible for verifying their proposed users.

But CCF can also be used for public scenarios, where the members have a shared interest in providing a service but the users may be unaffiliated. Since we're currently using TLS certs for auth, we fall back to existing solutions for pairing certs with identities. Perhaps the members fetch the user's cert from some external trusted PKI, perhaps they only accept certs provided in-person. We'd like to make this easier, by exposing efficient cert-inspections helpers to the scripting environments, and perhaps moving away from cert-identities (instead relying on something like Azure Active Directory to authenticate users). These are still early discussions.

Sorry this response is a little dense - I had several things I wanted to mention and haven't organised them very well!

douglasism commented 4 years ago

@eddyashton Eddy, again thank you for taking the time to respond with such detail.

This is my short response. More to follow soon.

An original goal of my proposal was to address my feature requests by reapplying the current design such that the analysis of the proposal would be straightforward. Sort of like CCF(CCF) = CCF. That's why I was so proud of the perceived fractalization. But as I was going about answering your excellent questions, I realized my idea that a node could be a replicant gateway, that is, participate in multiple replication circuits, was not as simple as I had thought. After reading several more research papers referenced in the TR, I've decided that proving equivalence of PBFT with multiple subnets spinning on the same state machine might garner me the A.M. Turing award. Certainly not as clear as CCF(CCF) = CCF. Not something I want to pursue.

I am not giving up, though. Tomorrow I will answer your questions directly and then proceed to see if I can salvage any parts of my beautiful diagrams.

douglasism commented 4 years ago

I retract the CCF(CCF) idea.

After more review, I've learned that the idea that a subnet of service nodes could reach sub-consensus and then present, via the primary, the view to the supernet for ultimate consensus deviates substantially from the protocols (and now makes my head spin). And while the question of whether a (sub) consortium can be a node in another consortium might be fun to discuss at a campfire, it does not belong in a constructive proposal to CCF. My intent was the reuse of the already vetted architecture, not a significant departure from it.

I will make explicit retraction notes in the original post.

I want to continue a discussion of the proposal's motivations, though. My next post will be answers to @eddyashton’s questions.

douglasism commented 4 years ago

Answers to @eddyashton’s questions:

Is it a sidechain? After my previous elaboration, it now should be clear that my debunked “subnet consensus” idea was not an Ethereum-like sidechain. If not, please let me know.

Fair billing

Framework adopters will judge CCF by how easy it is to design and build services. Given that rewards and incentives are core elements of similar systems, and that consortium members will bear potentially disproportionate infrastructure costs, it’s my intuition that some native CCF accounting support will make building and deploying services easier for consortia -- and cloud services providers. But I don’t speak with authority here so I won’t belabor the point.

Member-to-user association

Great! The ability for CCF to support these service-oriented notions is a testament to its architectural power.

Subnets

I agree with you, Eddy. (You were kind to refer indirectly to my idea as “awkward.” I might have used stronger language.)

The other aspect of subnets is the general consortium network topology question. I keep bringing this up, in part, due to my direct experience with infrastructure projects that wait too long to validate topology assumptions with actual customers, especially when deployment involves multiple customers’ networks.

It’s interesting in Project Bletchley’s network topology that users come in through a load-balancer and interface with nodes that are (emphatically) not mining. Separate, peered consortium subnets complete the transactions.

https://www.microsoft.com/developerblog/2017/04/18/using-a-layer-7-proxy-for-ethereum-blockchain/

User identity management

My comments, tomorrow.

achamayou commented 4 years ago

@douglasism the question of how billing models would work is not something we necessarily want to define completely, but my intuition at least is that in many if not most cases, the members would stay out of it completely.

A key feature of TEEs is the ability to have an untrusted operator spinning up and maintaining the network (provisioning replacement nodes, storing the ledger etc). If the members build and deploy an application that contains a counter per user (cost incurred per user, based on Tx count, KV storage size or any other relevant metrics), the operator can charge users directly for their usage.

Members never need to provision infrastructure or pay for it, users pay the operator according to usage.

douglasism commented 4 years ago

@achamayou Thanks for your response. I think enough has been said on this topic for now.

douglasism commented 4 years ago

@achamayou @eddyashton Hello Amaury and Eddy. I reopened this because I thought of a way to reframe the discussion in an actionable way. With #851, I realized that I didn't know what the actual design targets were for #members, #nodes, #users, and #transactions per second(successful). I may be imagining much higher numbers, which would cause my views on scalability to diverge from the project's goals. Are these design targets documented?

On a related note: The TR's VIII Evaluation section says that the CCF benchmarks used one client. Are there details on how this client interacted with the service? I searched but couldn't find the config for the 9-server test. Would you give me a hint on where this benchmarking test code is in the repository? I'm assuming that there is a configuration switch -- the CCF test framework is excellent.

douglasism commented 4 years ago

Another significant metric is KV store size, in memory and non-volatile storage, as @achamayou mentioned.

achamayou commented 4 years ago

@douglasism the benchmark in the TR is running the smallbank application. To run it yourself the easiest way to go is to do a local build (see instructions in the doc under Getting Started), run ./tests.sh -N (that will list the tests) and run the one that says smallbank.

To distribute the test across machines, copy paste the Python command line generated by the test (first line) and add -n hostname for each host you want to start a node on and -cn hostname for each node you want to start a client on. You need to have passwordless (ie. keys) ssh access from the launch box to the machines where you want to spin stuff up.

The “Perf” jobs of the CI run this across a 4 VM setup on every commit of every PR, feel free to have a look at .azure-pipelines and what it includes if you are interested in the details of how that works.

achamayou commented 4 years ago

The smallbank client and app code is under https://github.com/microsoft/CCF/tree/master/samples/apps/smallbank

douglasism commented 4 years ago

@achamayou Thanks for the pointers and your continued patience. I was familiar with smallbank and the tests, but what I had missed was the azure pipeline stuff. I should have found that myself, though. (The test and CI infrastructure are awesome, btw.) Also, and reminding myself again to READ THE CCF DOCS, I reviewed the excellent performance and metrics support CCF also provides. It will be highly beneficial when developing apps.

Broadly, the more I investigate the performance/scalability question the more I understand that finding the answer is not simple. As you know, there are many factors to consider: network & host config, CCF config, consensus algorithm & config, CCF design/impl, service design/impl., transaction characteristics, etc. When developing a service, one approach would be to use the smallbank app as a baseline, and build metrics while varying service-environment factors. With this data in hand, the smallbank is switched out and the service is plugged in, and then the same perf tests are run to determine the service perf delta, across all of its variants.

But even if perf is ultimately a service metric, it’s in the interest of CCF (and its adopters) for CCF to perform larger tests than are currently performed (in my limited project view). For example, and based on my reading, the core RAFT algorithmic performance is not adversely affected by a significant increase in the number of nodes (presuming the network does not become saturated). Does this hold true for CCF at 50 nodes? At 100? At what point does CCF overwhelm the network? How does the leader (primary) hold up? Are there practical CCF limits?

SuperBank My plan is to build and run a SuperBank test. It shouldn’t be too difficult given the existing CCF test infrastructure. I just have to figure out how to pay for the Azure resources. Hmmm...

achamayou commented 4 years ago

@douglasism the service perf delta is not fixed, some (many in fact) of the costs are broadly proportional to the write-set size of the transactions generated by the application, since they have to be serialised, hashed, integrity-protected etc. There are some sources of non-linearity too, since the incremental cost of the KV size changes dramatically once you reach the EPC size for example. It's not impossible to build a cost model, but it's not a trivial undertaking.

In theory, Raft algorithmic replication cost does not increase with nodes, but in practice it does, because replication over TCP is essentially linear in the number of nodes, albeit with a fairly small constant factor. This is purely a transport cost, replication and integrity protection of replicated entries is of course only happening once. With a very controlled network, one could imagine a replication mechanism based on UDP multicast which wouldn't suffer from this, but this isn't the kind of network CCF is targeting.

The point at which CCF overwhelms the network depends on... the network? The TR tests were run across two DCs, on in East US (Virginia I believe) and West Europe (Netherlands I think, or thereabouts), using Azure vnet peering (ie. the traffic traveled through the Azure backbone, not over the internet). That ran with little impact on throughput, but the latency impact you would expect from a transatlantic connection.

I'd rather not make a prediction about 100 nodes without data to back it up. I think it's worth keeping two things in mind though, the only reasons you'd ramp up the nodes are:

resiliency, ie. increasing your f nodes that can fail, where raft requires 2f + 1 in total. Depending on your opinion on the likelihood of single box failure vs network/power/dc failure, having dozens of nodes in the same location may not or may be not an efficient choice.
read scaling. All writes happen on the primary in Raft. Reads can happen anywhere. If you need to read a lot, then having a lot of nodes will help (a lot). If you mostly write, more nodes is just going to slow you down somewhat. This is slightly simplified, there are some other bits and pieces that scale with followers, like client signature verification if you use it. It's still roughly accurate though.

About SuperBank, you should be able to get some free Azure credits to begin with at least: https://azure.microsoft.com/en-us/free/ To get SGX VMs, I suggest UK South where new capacity has been added recently. It's possible that there's space left in East US/West Europe, but I'm not sure about that. I'm also happy to put you in touch with our PM if you want to have a business discussion with him.

douglasism commented 4 years ago

@achamayou As usual, a very informative response. (And it was funny too -- I almost choked on my apple when I read your network saturation comment...true, true.) Last night I was thinking about the read-scaling you pointed out, along with its inverse correlation to consensus perf. I can imagine a service provider dynamically scaling this way if the service were instrumented to report read/write statistics. Continuing along this theme, a service could partition its user interfaces such that read-only app transactions were targeted towards followers (more importantly: not toward the leader). I guess writes (especially signed ones) could be optimized this way too, by off-loading transaction pre-processing (e.g. sig checks) to the followers. Of course, you know this already. I’m just thinking in public.

achamayou commented 4 years ago

@douglasism emitting read/write (and also conflicts) stats is definitely something we want to do. We already have a hinting mechanism: https://microsoft.github.io/CCF/developers/logging_cpp.html?highlight=maywrite#rpc-handler, which allows handlers to specify what they do, or let the client suggest what they think they want to do. We also already implement the signature verification offloading on the secondaries (there's not much other pre-processing that can be offloaded unfortunately).

A missing piece is probably a small client library in at least the most popular client languages, which keeps track of the status of the nodes and pre-emptively sends to either the primary or the secondary, to avoid paying re-routing costs in the network.

douglasism commented 4 years ago

Closing, at least until I have an actionable enhancement request.

microsoft / CCF