spacemeshos / poet

Spacemesh PoET service reference implementation
MIT License
22 stars 13 forks source link

Poet 2.0 architecture and design #367

Open pigmej opened 1 year ago

pigmej commented 1 year ago

To improve Poet resiliency even further and be able to scale better we need to do arch changes in Poet.

Currently proposed idea by @noamnelke looks as follows: PoET Architecture

PoET Service Architecture.excalidraw.txt

noamnelke commented 1 year ago

This proposal also includes some architectural changes to the node. I'll explain it all in more detail when I have more time (since I don't think this is feasible before genesis), but I wanted to have this placeholder out.

poszu commented 1 year ago

@pigmej @noamnelke An updated proposal of the new Poet architecture (the image has Excalidraw scene embedded). poet-v2-arch

Short summary of Registration service API as of know + plans:

source: https://github.com/spacemeshos/poet/blob/develop/rpc/api/v1/api.proto

Open questions

Should we stick with levelDB in the registration service?

Is it required to be able to horizontally scale instances of the Registration service?

It would require:

I think it's not worth the effort, at least not at this stage. A single registration service will be faster and should be enough to serve hundreds thousands /Submit requests within the cycle gap window easily.

dshulyak commented 1 year ago

i think using kafka or anything is a gross over-engineering. publishing single message every two weeks can be done using grpc or native http/2 easily. if you want redundancy, e.g get membership from morer than one sevice, it can be achieved by connecting to more than one registration service.

if you will insist on doing "kafka", please support simple mode for system tests and standalone modes.

poszu commented 1 year ago

@dshulyak

i think using kafka or anything is a gross over-engineering.

What is gross about it? Please explain why if you think it's not the right tool for the job.

publishing single message every two weeks can be done using grpc or native http/2 easily

We don't want to expose worker servers to the Internet at all. The workers should instead pull data from the Registration service.

if you will insist on doing "kafka", please support simple mode for system tests and standalone modes.

Yes, support for a standalone mode is planned: https://github.com/spacemeshos/poet/issues/365:

💡 The option to run them together in a single process (aka a standalone mode) could probably stay for cloud deployments, go-sm unit- and system-tests etc.

dshulyak commented 1 year ago

What is gross about it? Please explain why if you think it's not the right tool for the job.

because problem is very simple. so the requirement that everyone who wants robust poet needs to maintain kafka cluster looks gross to me.

The workers should instead pull data from the Registration service.

i don't understand the difference. instead of "kafka" it can be pulled from this registration service. if you want to keep single URL in workers this registration service can aggregate data from multiple frontends

poszu commented 1 year ago

because problem is very simple. so the requirement that everyone who wants robust poet needs to maintain kafka cluster looks gross to me.

It's a fair point. Using an MQ would complicate the deployment a little by requiring setting up the message queue server. But is it that bad? Their deployment is usually straightforward. I proposed Kafka because this is an MQ I worked with before, but perhaps there are better/simpler solutions.

if you want to keep single URL in workers this registration service can aggregate data from multiple frontends

Isn't it re-inventing an MQ?

Assuming we shot down the idea for an MQ, what other good options do we have?

  1. A GRPC API, the registration service working as a server providing a way to:
    • pull membership root for the next round to execute,
    • post a new proof when a round is finished.
  2. Do you have any other ideas?
noamnelke commented 1 year ago

I'm working on my own proposal based on that by @poszu (mostly similar).

In the meantime, I'll say that I generally agree with @dshulyak that there must be an "easy mode" to run a PoET server, for tests but also for anyone who wants to easily run a private server for themselves.

I think we can satisfy everyone if we design PoET as a few building blocks that can be used in different ways:

Then we can use those in "standalone mode" by building some scaffolding around them that implements the API by calling into these modules from a single go executable.

For "infra mode" we'll build different scaffolding that can use a MQ to communicate between several different executables and data stores.

Both of these "modes" will be used internally - standalone in tests and infra mode for the actual PoET service that Spacemesh operates. There's some risk that bugs will exist in one version and not the other, but we can try to minimize this by keeping the scope of the separate scaffolding minimal.

Does this make sense to you guys, or sound over engineered and overly complex?

dshulyak commented 1 year ago

i don't think that supporting 3 modes of operation is a good idea. if there could be one that is reliable and efficient we should be using it everywhere.

It's a fair point. Using an MQ would complicate the deployment a little by requiring setting up the message queue server. But is it that bad?

so it means that kafka-based mode will be reliable, and everything else we will say that may not be that reliable. so everyone who needs to help maintaining poet will have to learn how to work and debug kafka. and whoever deploys it will also need to have basic understanding of what can go wrong.

A GRPC API, the registration service working as a server providing a way to:

this is what i have in mind. i don't think that downloading data periodically on notification implies reinventing mq.

Does this make sense to you guys, or sound over engineered and overly complex?

it does seem to me unnecessarily complex

pigmej commented 1 year ago

Here are a few requirements from my side that definitely should be considered: