Closed jm-clius closed 8 months ago
@richard-ramos @chaitanyaprem can you check that the guidelines here are clear enough, given our discussion on https://github.com/waku-org/nwaku/issues/1914. With the idea that go-waku drives Status dogfooding, I think the best would be for go-waku to also ensure that this fleet gets set up, so feel free to let me know if there's anything unclear or if you have some other suggestions!
@Ivansete-status do we have a guide/tutorial/example somewhere on how the PostgreSQL configuration is done on nwaku?
Hey @jm-clius !
We don't have an explicit doc.
The next repo has an example of how an nwaku
node is connected to a Postgres db. All in a docker compose file --> https://github.com/waku-org/nwaku-compose/blob/master/docker-compose.yml
DB config file: https://github.com/waku-org/nwaku-compose/blob/master/postgres_cfg/postgresql.conf
URL example the nwaku
node uses to connect to a Postgres:
https://github.com/waku-org/nwaku-compose/blob/176657f4eb9e0a16e9b81b73cada268172667385/run_node.sh#L30
Thank you for these guidelines. I looked at them and they do look clear to me!
Just to confirm: status.sharding.store
will have DiscV5 enabled to find peers? or will it only rely on the initial connection done to the nodes retorned by status.sharding.bootstrap
dns discovery URL?
Thanks for the guidelines @jm-clius. So far, these are clear to me!
Just to confirm: status.sharding.store will have DiscV5 enabled to find peers?
Yes, I think they should have discv5 enabled.
Some questions:
sharding
and not shards
as is the wakuv2
one?wakuv2.shards
fleet?wakuv2
?Do you actually want two fleets or are you talking about two types of nodes within one fleet?
Based on the description i understand it's two separate fleets, each one of them having their own DNS discovery URL and own settings, but interconnected by having status.sharding.store
use status.sharding.bootstrap
dns discovery url + discv5
Why is this fleet called sharding and not shards as is the wakuv2 one?
I guess the name can be changed to match wakuv2 :thinking: no preference here
Answering:
Why is this fleet called sharding and not shards as is the wakuv2 one?
Ah, oversight/assumption. It should be shards
. Actually the naming itself is a suggestion from my side, but I think status.shards.store
and status.shards.bootstrap
make sense.
How is this different from the wakuv2.shards fleet?
wakuv2.shards
is indeed the blueprint that the Waku team use to harden most of the concepts we need for these status.shards.*
fleets. The main differences are:
store
protocol will only be mounted on status.shards.store
and light protocols only on status.shards.bootstrap
(as set out above)status.shards.*
will serve real Status communities.status.shards.*
will also configure --protected-topic:
, which is absent from wakuv2.shards
Is this intended for public use, and if not and it's intended for development
It's intended for public use in future. For now it is intended for Status Community dogfooding and must belong to Status.
Do you actually want two fleets or are you talking about two types of nodes within one fleet?
Two separate fleets with separate DNS node lists. But these fleets are closely associated in that the status.shards.store
fleet must connect to the status.shards.bootstrap
fleet. They will also both serve the same communities, so from the community's perspective form a unit.
What does "shared PostgreSQL backend mean?
My suggestion for starting point: "one database host shared across multiple data centers with different latency"
Previously (with SQLite) each store node would store the same copy (duplicate) of the entire history locally. With PostgreSQL we have the opportunity to only store fewer copies of history in a shared database, with multiple producers/writers. Deduplication happens on write to the database. This should improve data reliability. We can of course play with redundancy parameters here and have e.g. a postgresql db per data center. I think for dogfooding a single postgresql instance shared between all store nodes is a reasonable starting point. One of the aims of wakuv2.shards
is to benchmark this approach more thoroughly. cc @Ivansete-status
Thanks for answering. One thing I want to make clear form the start is that there can't be something like status.shards.store
. Our fleet names have two elements, env and stage, like so: ${env}.${stage}
. For example wakuv2.prod
. There is nothing in any of our automations or naming conventions that allows for a 3 segment fleet name.
If You need both words in the name then it would have to be status.shards-store
and status.shards-boot
.
My main worry with a shared DB across different data centers is that due to higher network latency the DB will spend more time than necessary locked during writes, due to delays on various control messages.
If You need both words in the name then it would have to be
status.shards-store
andstatus.shards-boot
.
I'm happy with this naming convention.
My main worry with a shared DB across different data centers
Right! Think this is a good point. I have no real intuition here and this setup would need to be properly dogfooded in any case (we have no postgresql DBs in production yet). Perhaps a better solution here is to e.g. use only two different data centers for now for status.shards-store
(3 store nodes in the one, 2 in the other) with each having its own shared postgresql database (2 in total).
TL;DR - let's opt for one postgresql instance per data center.
Sidenote: 5
was a number pulled out of a hat. Could of course also do 6
store nodes, which is more easily divisible between 2 or 3 data centers.
Okay, last question, what is a deadline for this? I would like to assign this task to @yakimant to get him aquainted with infrastructure work, especially in relation to Waku fleets. But that will naturally mean work will progress slower.
I think up to @richard-ramos and @chaitanyaprem in terms of driving Status Community dogfooding here. It's probably possible to do some limited initial dogfooding here using wakuv2.shards
instead, which would give us some more time.
let's opt for one postgresql instance per data center. I like this approach a lot. We will implement the Store synchronization among data centers in the future.
If possible, we should have high-availaility and replication, but I think this can be added in further stages. For now is fine to have one Postgres per datacenter.
Re the shards names, neither status.shards-store
or status.shards-boot
seem to follow the pattern ${env}.${stage}. What about status-shards-store.prod
or status-shards-boot.prod
? ( cc @jakubgs @jm-clius )
What about status-shards-store.prod or status-shards-boot.prod?
But those are not prod, so why would we call them prod?
What about status-shards-store.prod or status-shards-boot.prod?
But those are not prod, so why would we call them prod?
I set it as prod
because <<It's intended for public use in future>>.
In any case, I think we need to add a valid stage
label (test or staging or prod) so that we remember in the future the kind of traffic and data the fleet is supporting.
I think we shouldn't pretend a non-prod fleet is prod because it might be prod in the future. If we stabilize this setup then we'll probably create a separate new fleet(s) that apply this setup to new fleets.
And honestly then proper naming would be more like status-store.prod
and status-boot.prod
. Though i still don't fully get why you guys think those should be separate fleets. We already have heterogenous fleets that contains different types of hosts in them. You can see that in infra-eth-cluster
which deploys eth.staging
or eth.prod
, which contains 3 types of hosts: boot
, node
, mail
.
It's probably possible to do some limited initial dogfooding here using wakuv2.shards instead, which would give us some more time.
I see no problems with using that fleet if it can provide topic protection for /waku/2/rs/16/128
and /waku/2/rs/16/256
, store, and discv5! early stage dogfooding will be mostly focused on testing the behavior of status-go when sharding is applied to communities
I agree that we shouldn't name them prod
then.
I vote for: status-store.test
and status-boot.prod
Having that kind of segregation I think is good because:
Clear picture of what a fleet contains
That already exists in infra-eth-cluster
, see different files in group_vars
.
Different fleets might have different update requirements
I don't see how that's relevant. If they work in tandem as Production they are essentially part of status main fleet, just different components of it. Think of it this way. If an application A requires a database D and a cache C, you don't deploy A, B, and C as separate fleets, you deploy them all as one fleet that contains different groups of servers: A, B, and C.
Different fleets might have different scaling requirements
Also irrelevant. You can scale them separate if you want to. See infra-eth-cluster
.
Correct me if I'm wrong, but it seems to me like you are considering bootstrap and store as separate "fleets" purely because they have different utility, but I think that's wrong. I think those are tied together and are part of the same fleet.
It also doesn't make sense infrastructure config-wise. Since a fleet repo like infra-status
configures a single type of fleet, and that fleet can have multiple stages. But those stages are mostly the same in layout and configuration, just different type of usage. Considering the VERY different fleet layout I'd think using a separate fleet repo like infra-status-shards
would make more sense, since trying to square the circle of two very different types of fleets being deployed by the same repo will only make both the Terraform and Ansible configuration needlessly complex.
What do you think?
I think those are tied together and are part of the same fleet.
Agreed. I think from my perspective it's about whatever makes most sense from an infra perspective in terms of management. Indeed, these two "subfleets" are connected into a single service fleet, that acts as a unit to provide services to Status Communities, will most probably always be running the same version, be interconnected, etc.
The differences are:
store
vs bootstrap
store
connects to bootstrap
, but bootstrap
also connects to bootstrap
If these differences can be managed from within a single fleet with different node "types", I think it will indeed be much cleaner!
I'd think using a separate fleet repo
Yeah, I'm viewing this as building on the wakuv2.shards
example, but with more complexity. Since we are now introducing different node types, I would agree that it's diverged enough to warrant a separate repo.
We had a call with @jakubgs, @jm-clius and @Ivansete-status, here are the notes:
Stakeholders
Hosts
boot
, 6x store
, 3 store-db
, split evenly into 3 DCstatus
/ wakuv2
fleetsTimeline
boot
, than store
with sqlite, than store-db
, than store
connectedDB
store
nodes can write and rely on failed INSERT
operationCode organisation
shards
infra-shards
infra-template
and fill in missing parts from infra-status
and infra-wakuv2
boot
, store
, store-db
) - Ansible groups, similar to eth
fleettest
environment (Terraform workspace and Ansible inventory)infra-tf-multi-provider
infra-role-nim-waku
for boot
and store
nodesinfra-role-postgres-ha
for store-db
nodes, disable replication parameterMetrics
Later
status
fleet will be deprecated in the future. In favour of shards
wakuv2/shards
is not a priority, will be deprecated in favour of shards
INSERT
metricsThanks for the notes @yakimant ! A couple of comments:
wakuv2.shards
in the near future. Just de-prioritize from any maintenance.${env}.${stage}
. Maybe we can start now with shards.test
? I'm open to any name as long as it contains the stage
.Yes, fleet name will include stage
name too, don't worry!
@richard-ramos, @jm-clius, quick question:
why do we need an enrtree
DNS record for storage
nodes if it is not referenced?
Both bootstrap
and storage
nodes reference bootstrap
node list as I understand.
why do we need an enrtree DNS record for storage nodes if it is not referenced? Both bootstrap and storage nodes reference bootstrap node list as I understand.
Correct. I think it is so that Community Nodes can reference the store nodes separately from the bootstrap nodes.
Hanno adds:
My understanding from Richard is: they bootstrap against a list of nodes and they create a set of store nodes from a different list (in fact, they hard-code this store list currently, but should do a separate lookup to populate this dynamically). If the store nodes are mixed into the bootstrap list, they will automatically form part of the bootstrap process which we'd like to avoid to create a better separation of interests in the fleet services.
@jm-clius, @Ivansete-status, Please have a look if you can start using fleet for testing already.
What is missing:
postgresql.conf
configuration, default is used@yakimant what is the dns discovery addresses for this fleet?
nvm. Saw the README.md :)
I validated that the next hosts have both relay
and store
properly configured. Amazing work @yakimant !
/dns4/store-01.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmAUdrQ3uwzuE4Gy4D56hX6uLKEeerJAnhKEHZ3DxF1EfT
/dns4/store-02.do-ams3.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAm9aDJPkhGxc2SFcEACTFdZ91Q5TJjp76qZEhq9iF59x7R
/dns4/store-01.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmMELCo218hncCtTvC2Dwbej3rbyHQcR8erXNnKGei7WPZ
/dns4/store-02.gc-us-central1-a.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAmJnVR7ZzFaYvciPVafUXuYGLHPzSUigqAmeNw9nJUVGeM
/dns4/store-01.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAm2M7xs7cLPc3jamawkEqbr7cUJX11uvY7LxQ6WFUdUKUT
/dns4/store-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAm9CQhsuwPR54q27kNj9iaQVfyRzTGKrhFmr94oD8ujU6P
To test it, I followed the next steps:
Start a nwaku
node locally, setting staticnode
and storenode
to the first node from previous list. With that, I established the remote relay
and store
peer, respectively.
i. Then, I made my local node to publish a few messages by sending json-rpc requests to my local node.
ii. Then, I made a store request to my local node, through json-rpc, and in turn my local node made a Store request to the remote node. Then, the json-rpc command returned the messages stored by the *F1EfT peer.
I repeated the step 1.ii. with the rest of the nodes. In all cases, the stored messages where properly received. Indicating that all the nodes are relay-connected among them, and also, all properly handle Store requests.
Notes:
I run the wakunode2
with the next command: ./build/wakunode2 --config-file=cfg_node_a.txt
, and I've been editing the config file to follow the above steps.
cfg_node_a.txt
This is the command I used to publish messages:
curl -d "{\"jsonrpc\":\"2.0\",\"id\":"$(date +%s%N)",\"method\":\"post_waku_v2_relay_v1_message\", \"params\":[\"/waku/2/rs/16/64\", {\"contentTopic\": \"jamon\", \"timestamp\":"$(date +%s%N)", \"payload\":\"cmFuZG9tCg==\"}]}" --header "Content-Type: application/json" http://localhost:8546
This is the command I used to make my local node retrieve the stored messages:
curl -d '{"jsonrpc":"2.0","id":"id","method":"get_waku_v2_store_v1_messages"}' --header "Content-Type: application/json" http://localhost:8546
cc: @jm-clius
I can see messages in db:
nim-waku=> select id from messages;
id
------------------------------------------------------------------
2be984aa463c976553b7ab089c6708fa97509449f7359693ae98665fecb1a652
2be984aa463c976553b7ab089c6708fa97509449f7359693ae98665fecb1a652
2be984aa463c976553b7ab089c6708fa97509449f7359693ae98665fecb1a652
f0f81dd5f875ebac8402617ac0c3cd0e7b0c72de4e23cacd23fc7362aacd8b77
(4 rows)
Great, I think further dogfooding on this fleet can continue. @richard-ramos
@jm-clius shall we close the issue or wait for testing results?
@yakimant I think we can close this issue and raise any further problems/issues separately. The bulk of the work here has been done. Thanks for your excellent work here!
Requirements
status.sharding.store
: 5 nodes configured only withrelay
andstore
. These will be the store/historical message providersstatus.sharding.bootstrap
: 5 nodes configured withrelay
,filter
,lightpush
,peer-exchange
. These are the main bootstrap nodes and also provide services to resource-restricted nodesConfiguration common to both fleets:
wakuv2.shards
as inspirationPubsub Topics
Protected Topics
Configuration for
status.sharding.bootstrap
relay
,filter
,lightpush
,peer-exchange
(nostore
)enrtree:
URLnim_waku_dns_disc_url
points to own DNS node listConfiguration for
status.sharding.store
relay
,store
enrtree:
URLnim_waku_dns_disc_url
points tostatus.sharding.bootstrap
DNS node list (NB)