status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 5 forks source link

Research alternatives to Hetzner for testnet hosting #132

Closed jakubgs closed 1 year ago

jakubgs commented 2 years ago

Since Hetzner hates crypto mining and states so in their policy:

Therefore the following actions are prohibited:

  • Operating applications that are used to mine crypto currencies

https://www.hetzner.com/legal/dedicated-server/

We will need to migrate out hosts to an alternative server provider soon. To do that we need to find a suitable alternative.

The general requirement involve:

We'll need about 21 hosts with similar specs.

jakubgs commented 1 year ago

Sync progress is kinda uneven:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3473549","sync_distance":"1616969","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"3592319","sync_distance":"1498199","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3463351","sync_distance":"1627167","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"2694979","sync_distance":"2395539","is_syncing":true,"is_optimistic":true}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*                                                        
23G /data/beacon-node-prater-libp2p
23G /data/beacon-node-prater-stable
24G /data/beacon-node-prater-testing
64G /data/beacon-node-prater-unstable
4.0K    /data/era
16K /data/lost+found
132G    total
jakubgs commented 1 year ago

There's something weird happening:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1505045","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505045","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505045","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3040637","sync_distance":"2056727","is_syncing":true,"is_optimistic":true}}

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1505070","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505070","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505070","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3041869","sync_distance":"2055520","is_syncing":true,"is_optimistic":true}}

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1505075","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505075","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505075","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3042125","sync_distance":"2055269","is_syncing":true,"is_optimistic":true}}

It seems like the sync_distance is increasing, and not decreasing. What the hell?

jakubgs commented 1 year ago

For some reason disk I/O has been going down and down:

image

Which seems to indicate that we're syncing slower and slower.

jakubgs commented 1 year ago

But network traffic is steady:

image

jakubgs commented 1 year ago

Seems like Fanout health has gotten worse and the pattern matches I/O drop:

image

jakubgs commented 1 year ago

Honestly, it seems stuck:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1511541","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1511541","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592321","sync_distance":"1511539","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3355177","sync_distance":"1748683","is_syncing":true,"is_optimistic":true}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*                                                        
23G /data/beacon-node-prater-libp2p
24G /data/beacon-node-prater-stable
24G /data/beacon-node-prater-testing
66G /data/beacon-node-prater-unstable
4.0K    /data/era
16K /data/lost+found
135G    total

image

Not sure what to do about that.

jakubgs commented 1 year ago

Since it seems like ones other than unstable are stuck I've restarted testing after removing the SQLite DB:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1519492","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"711","sync_distance":"5111100","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"3945341","sync_distance":"1166470","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"3592319","sync_distance":"1519492","is_syncing":true,"is_optimistic":false}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*                                                        
24G /data/beacon-node-prater-libp2p
24G /data/beacon-node-prater-stable
3.8G    /data/beacon-node-prater-testing
82G /data/beacon-node-prater-unstable
4.0K    /data/era
16K /data/lost+found
133G    total
jakubgs commented 1 year ago

I found this in stable node logs:

{
  "lvl": "INF",
  "ts": "2023-03-03 13:23:24.000+00:00",
  "msg": "Slot start",
  "topics": "beacnde",
  "slot": 5111817,
  "epoch": 159744,
  "sync": "04d04h22m (70.19%) 4.1865slots/s (QQQQQQQUQQ:3598911)",
  "peers": 306,
  "head": "b0317895:3592319",
  "finalized": "112257:b5bf95d9",
  "delay": "352us289ns"
}

So I'm leaving stable and libp2p as is.

jakubgs commented 1 year ago

The Geth node sync seems absurdly fast:

image

But right now it's doing this:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % tail -n3 /var/log/docker/geth-goerli-03-node/docker.log
INFO [03-06|10:08:22.536] Syncing: chain download in progress      synced=100.00% chain=65.26GiB  headers=8,559,384@2.87GiB    bodies=8,559,319@54.00GiB   receipts=8,559,319@8.39GiB    eta=2.363s
INFO [03-06|10:08:30.542] Syncing: chain download in progress      synced=100.00% chain=65.26GiB  headers=8,559,384@2.87GiB    bodies=8,559,319@54.00GiB   receipts=8,559,319@8.39GiB    eta=2.363s
INFO [03-06|10:08:38.546] Syncing: chain download in progress      synced=100.00% chain=65.26GiB  headers=8,559,384@2.87GiB    bodies=8,559,319@54.00GiB   receipts=8,559,319@8.39GiB    eta=2.363s

Which seems like it's close, but not quite there. Seems kinda stuck since it shows eta=2.363s over and over again.

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % grep 'eta=2.36' /var/log/docker/geth-goerli-03-node/docker.log | wc -l
66
jakubgs commented 1 year ago

That's possibly because the unstable beacon node isn't fully synced yet:

{"data":{"head_slot":"3592319","sync_distance":"1540138","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"1744763","sync_distance":"3387694","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"5076117","sync_distance":"56340","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"3592319","sync_distance":"1540138","is_syncing":true,"is_optimistic":false}}
jakubgs commented 1 year ago

We can also see that the CPU usage has went down:

image

I assume it will go further down once all nodes are synced.

jakubgs commented 1 year ago

I'm tempted to just say the host is good and other the 20 hosts we need to migrate.

jakubgs commented 1 year ago

Looks like unstable is fully synced, and so is its Geth node:

admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1546922","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"2251622","sync_distance":"2887619","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"5139241","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3766675","sync_distance":"1372566","is_syncing":true,"is_optimistic":true}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/* /docker/*
29G /data/beacon-node-prater-libp2p
24G /data/beacon-node-prater-stable
23G /data/beacon-node-prater-testing
157G    /data/beacon-node-prater-unstable
18G /data/era
16K /data/lost+found
9.6M    /docker/geth-goerli-01
11M /docker/geth-goerli-02
256G    /docker/geth-goerli-03
70G /docker/geth-goerli-04
3.0M    /docker/log
16K /docker/lost+found
573G    total

And the CPU load is now mostly coming from stable and libp2p which are still syncing.

jakubgs commented 1 year ago

I have asked support what happens if we need a bigger SSD in the future:

We have in stock 800GB and 1.6TB Enterprise SSD only. We can add also 3.84TB SSD but the delivery time is 10 to 20 days. You can upgrade the storage capacity but will have to wait the delivery time.

So there is an option for bigger SSDs, we just have to ask them to purchase them in advance.

jakubgs commented 1 year ago

It took the unstable node ~12 days to fully sync and start receiving attestations, from 2023-02-22 to 2023-03-06:

image

Which is not terrible. I think we can go ahead with the order of 20 servers.

jakubgs commented 1 year ago

According to Zahary the low mesh health is a red flag:

image

It should be looking like this: linux-01.he-eu-hel1.nimbus.prater

image

jakubgs commented 1 year ago

If we look at host Packet loss in comparison to a Hetzner host it's fine. Just a bit on the 6th:

linux-01.ih-eu-mda1.nimbus.prater

image

linux-01.he-eu-hel1.nimbus.prater

image

jakubgs commented 1 year ago

Network latency looks fine too:

linux-01.ih-eu-mda1.nimbus.prater

image

linux-01.he-eu-hel1.nimbus.prater

image


Innova Hosting network seems less consistent but lower latency overall.

jakubgs commented 1 year ago

And if we look at 1.1.1.1 itself it's actually better than Hetzner:

linux-01.ih-eu-mda1.nimbus.prater

image

linux-01.he-eu-hel1.nimbus.prater

image


By whole ~15 miliseconds.

jakubgs commented 1 year ago

The main difference between hosts is the CPU:

And indeed, the Hetzner host has 2 more cores and they have higher frequency. And we can see some cores being saturated at times:

image

But that's coming from the nodes that are still syncing, and not already synced, which unstable node already is:

image

So I don't think this issue is caused by the CPU, but not 100% sure.

jakubgs commented 1 year ago

Apparently the issue is that for some reason other nodes are giving us a bad score.

@Menduist will investigate why exactly we are getting a bad score.

Menduist commented 1 year ago

It should be looking like this: linux-01.he-eu-hel1.nimbus.prater

Just for reference, when comparing mesh health, synced nodes with similar network, validator count & --subscribe-all-subnets must be used. And for some reason, people are using the public api on this node with their VC, which falsify the valicator count

beacon-node-prater-libp2p@linux-01.he-eu-hel1.nimbus.prater (0 validator, no subscribe-all-subnets, but restarted less often) image

beacon-node-prater-unstable@linux-01.ih-eu-mda1.nimbus.prater (0 validator, no subscribe-all-subnets) image

The synced part of the second graph does seem more chaotic than the first node

Menduist commented 1 year ago

Started to look into this, the flickering topics are the light clients one, not sure why yet. AFAICT, the light clients topic don't have any REJECT rule, so it's not a case of us sending invalid stuff

I opened a PR to add a metric that will tell us if the new nodes are missing light client messages: https://github.com/status-im/nimbus-eth2/pull/4745

When connecting a node on my laptop to beacon-node-prater-unstable@linux-01.ih-eu-mda1.nimbus.prater, they stay in mesh without issue

Menduist commented 1 year ago

Btw, the node on my laptop also has issues to keep peers in the light clients topics, so it doesn't seem specific to the server

Menduist commented 1 year ago

I'll continue looking into this at some point, but TLDR not a blocker to switch provider, it's just random depending on who are we connected to @jakubgs

jakubgs commented 1 year ago

I have contacted their support to discuss costs and payment options:

I have discussed the server with the team and it looks like we are good to go with a purchase of 20 servers like that. Now the questions are:

  • What would be the monthly price?
  • What would be the quarterly price?
  • What would be the annual price?
  • Would it be possible to pay via bank transfer annually?
jakubgs commented 1 year ago

Looks like the best they can offer right now is 5% discount if we pay for a year:

The price for one month is 173Euro/month

If you pay for one year i can make 5% discount

PS:I will have to check with our support team if we have 20 servers available for activation. If not I will tell you have many servers we have available now and the delivery time for the quantity that we do not have in stock but usualy is about 10 days

That's a bit low if you ask me. And I don't worry about the delay as I will be away all of next week.

jakubgs commented 1 year ago

I've asked them about the final amount and preferred crypto:

Ok, so my understanding is that a bulk order of 20 server with an annual payment plan using crypto with a total 10% discount will cost:

17320 * 12 * 0.9
37368.0 EUR

Is this correct? Also, what cryptocurrencies are you willing to accept?

And I've invited deivids@status.im to our org so he can discuss the payment details with them.

jakubgs commented 1 year ago

Got a response about crypto they accept:

The only crypto that can be processed automaticaly on our website is Bitcoin, but we do accept and process the transaction manually for USDT and ETHs

And for use USDT will work out fine. I asked for the invoice.

Menduist commented 1 year ago

BTW, we should test a node with --subscribe-all-subnets=true to see if the CPU follows along, since they are apparently less powerful?

jakubgs commented 1 year ago

I enabled it on all 4 nodes and this is what we get:

image

It appears to work fine with about 3-fold increase in average load, but not going beyond 7, when we have 10 cores.

Should be fine. But lets take a look at the node graphs too.

jakubgs commented 1 year ago

We get more disk I/O but not that much:

image

jakubgs commented 1 year ago

These graphs look worse now, but that might be temporary, probably:

image

jakubgs commented 1 year ago

Looks like stable it's recovering:

image

unstable looks good already:

image

jakubgs commented 1 year ago

Got a response from InnovaHosting about status of our order:

We did not receive all the components yet, we will receive in a few days

We can now activate about less than half of your order and in a few days we can activate the rest of the order

Should we activate the available servers now and an the rest of the m in a few days or we wait to activate all together?

I told the the sooner the better, so we can start the setup.

jakubgs commented 1 year ago

We finally received some of the servers:

image

I will deploy them for nimbus.mainnet fleet, since that's the one that's low on disk space.

jakubgs commented 1 year ago

For some reason they are giving us hosts with different CPUs and different numbers of them:

 > a ih-eu-mda1 --become -a 'dmidecode -t processor | grep Version' 
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz      
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz      
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
    Version: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
    Version: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
linux-01.ih-eu-mda1.nimbus.prater | CHANGED | rc=0 >>
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz      
    Version:                                                 
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz      
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz  

This is most probably detrimental to our purpose of testing software on consistent environments. I will ask their support about this.

jakubgs commented 1 year ago

I bootstrapped the first three hosts and deployed the old setup with 6 nodes on each host:

But I want to hold off on deploying notes to hear from support about the weird CPU layout. I think we'd rather have a homogeneous setup with the same hardware on each host to avoid tiny differences that could affect our ability to consistently compare behavior of nodes between hosts.

jakubgs commented 1 year ago

Their response about mixed CPU layouts is:

For us is hard to provide 20 identical servers with same CPU, same RAM, etc. We may provide 2-3 different configs but with close to each other cpu models, with 8-10 CPU cores, 2.4-3.0Ghz frequency.

If you need absolutely identical 20 servers, then we have to agree on configuration, after that we will try to freeup/purchase exact configuration, but it will be for couple time and will depend, not all components may be available on our suppliers.

No, we didn't charge something additional for double CPU, but if you dont need Dual CPU, then we may provide server with only 1 CPU installed.

I've asked them to just install just a single Xeon E5-2690 v2 in all hosts for now.

jakubgs commented 1 year ago

Their support has reconfigured the hosts to now how only a single Xeon E5-2690 v2:

 > a ih-eu-mda1 --become -a 'dmidecode -t processor | grep "Version:  Intel"'
linux-01.ih-eu-mda1.nimbus.prater | CHANGED | rc=0 >>
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz      
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz      
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz      
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
    Version:  Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz   

And also a 5th host has been provisioned for us.

jakubgs commented 1 year ago

I added the 4th and 5th mainnet hosts and reverted temporary node layout changes:

jakubgs commented 1 year ago

Since we're going to be discontinuing or re-using the Hetzner hosts I've made a list of what we currently have:

ID Hostname Model CPU Extra
1424980 goerli-01.he-eu-hel1.nimbus.geth AX41 Ryzen 5 3600 6-Core 1 TB NVMe SSD
1485912 goerli-02.he-eu-hel1.nimbus.geth AX41 Ryzen 5 3600 6-Core 1 TB NVMe SSD
1485914 goerli-03.he-eu-hel1.nimbus.geth AX41 Ryzen 5 3600 6-Core 1 TB NVMe SSD
1660565 linux-01.he-eu-hel1.nimbus.sepolia AX41 Ryzen 5 3600 6-Core
1787515 linux-01.he-eu-hel1.nimbus.prater AX61 Ryzen 9 3900 12-Core
1787528 linux-02.he-eu-hel1.nimbus.prater AX61 Ryzen 9 3900 12-Core
1787530 linux-03.he-eu-hel1.nimbus.prater AX61 Ryzen 9 3900 12-Core
1787532 linux-04.he-eu-hel1.nimbus.prater AX61 Ryzen 9 3900 12-Core
1787547 linux-05.he-eu-hel1.nimbus.prater AX61 Ryzen 9 3900 12-Core 1.92 TB NVMe SSD
1787573 linux-06.he-eu-hel1.nimbus.prater AX61 Ryzen 9 3900 12-Core 1.92 TB NVMe SSD
1517896 metal-01.he-eu-hel1.nimbus.eth1 AX41 Ryzen 5 3600 6-Core 2 TB NVMe SSD
1674260 metal-01.he-eu-hel1.nimbus.fluffy AX41 Ryzen 5 3600 6-Core
1674261 metal-02.he-eu-hel1.nimbus.fluffy AX41 Ryzen 5 3600 6-Core
1551432 metal-01.he-eu-hel1.nimbus.mainnet AX41 Ryzen 5 3600 6-Core 2 TB NVMe SSD
1551433 metal-02.he-eu-hel1.nimbus.mainnet AX41 Ryzen 5 3600 6-Core 2 TB NVMe SSD
1551434 metal-03.he-eu-hel1.nimbus.mainnet AX41 Ryzen 5 3600 6-Core 2 TB NVMe SSD
1551436 metal-04.he-eu-hel1.nimbus.mainnet AX41 Ryzen 5 3600 6-Core 2 TB NVMe SSD
1551437 metal-05.he-eu-hel1.nimbus.mainnet AX41 Ryzen 5 3600 6-Core 2 TB NVMe SSD
1551438 metal-06.he-eu-hel1.nimbus.mainnet AX41 Ryzen 5 3600 6-Core 2 TB NVMe SSD
1890187 metal-07.he-eu-hel1.nimbus.mainnet AX41 Ryzen 5 3600 6-Core 2 TB NVMe SSD
1666168 windows-01.he-eu-hel1.nimbus.prater AX41 Ryzen 5 3600 6-Core Windows Server 2019

As we can see we have 6 Prater testnet AX61 hosts with Ryzen 9 3900 12-Core which would probably work well in our CI.

jakubgs commented 1 year ago

There's something wrong with unstable branch. It doesn't want to sync from a trusted node properly:

admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"1003256","sync_distance":"5252945","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6256200","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256201","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256200","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256004","sync_distance":"197","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6252672","sync_distance":"3529","is_syncing":true,"is_optimistic":false}}

I synced ERA files from our old hosts and as we can see stable-02, testing-01, and testing-02 have synced fine using a trusted node API endpoint. I have left stable-01 to sync fully without using a trusted node, but both unstable nodes are having a hard time syncing.

Same thing we can see on linux-02:

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"968867","sync_distance":"5287335","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6256201","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256201","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256202","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6252789","sync_distance":"3413","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6252725","sync_distance":"3477","is_syncing":true,"is_optimistic":false}}

They seem to be going backwards.

jakubgs commented 1 year ago

Some Hetzner unstable nodes have the same issue where sync_distance is growing instead of shrinking:

admin@metal-02.he-eu-hel1.nimbus.mainnet:~ % c 0:9304/eth/v1/node/syncing | jq -c
{"data":{"head_slot":"6250839","sync_distance":"5431","is_syncing":true,"is_optimistic":false}}
admin@metal-02.he-eu-hel1.nimbus.mainnet:~ % c 0:9304/eth/v1/node/syncing | jq -c
{"data":{"head_slot":"6250839","sync_distance":"5434","is_syncing":true,"is_optimistic":false}}
admin@metal-02.he-eu-hel1.nimbus.mainnet:~ % c 0:9304/eth/v1/node/syncing | jq -c
{"data":{"head_slot":"6250839","sync_distance":"5435","is_syncing":true,"is_optimistic":false}}
jakubgs commented 1 year ago

@tersec suggested to try adding --sync-light-client=off flag based on possible impact of this PR being merged last week:

jakubgs commented 1 year ago

Got another response from support:

From our supplier we gonna purchase following CPU models:

INTEL XEON E5-2643 v4 3.40GHz 20MB 6-CORE CPU 20 pieces INTEL XEON E5-2667 v3 3.20GHz 20MB 8-CORE CPU 8 pieces

I want to know if this CPU will be good for you, because at the moment, there are CPU shortage of E5-2690 v2, we have only 4 CPU left for today.

Seems to me like E5-2667 v3 is a better option based on this:

image

https://www.cpubenchmark.net/compare/2811vs2441vs2057/Intel-Xeon-E5-2643-v4-vs-Intel-Xeon-E5-2667-v3-vs-Intel-Xeon-E5-2690-v2

But I'm worried it has less threads than the E5-2690 v2. Maybe we could use the E5-2690 v2 for prater and E5-2667 v3 for mainnet.

jakubgs commented 1 year ago

I tried with --sync-light-client=off but we are still gaining sync distance:

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done            
{"data":{"head_slot":"6256544","sync_distance":"223","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6256513","sync_distance":"254","is_syncing":true,"is_optimistic":false}}

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done            
{"data":{"head_slot":"6256544","sync_distance":"228","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6256513","sync_distance":"259","is_syncing":true,"is_optimistic":false}}

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"6256544","sync_distance":"231","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6256513","sync_distance":"262","is_syncing":true,"is_optimistic":false}}
jakubgs commented 1 year ago

If we check the Slot start log messages we see that not much in them is changing:

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % grep 'Slot start' /var/log/service/beacon-node-mainnet-unstable-02/service.log | grep '"finalized":"195514:c6856d5f"' | wc -l
107

But slot is growing:

admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % grep 'Slot start' /var/log/service/beacon-node-mainnet-unstable-02/service.log | jq -c '{ slot, epoch, head, finalized }' | sort -u | tail -n5
{"slot":6256939,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
{"slot":6256940,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
{"slot":6256941,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
{"slot":6256942,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
{"slot":6256943,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
jakubgs commented 1 year ago

After talking with @tersec we've decided to do a dumb bisect of commits. We'll use the current state of testing/stable as a starting point, which gives us a difference of 75 commits. This should not require too many tests.

Here's the layout I'm going with for first round and the results:

Date Host Node Commit Result
2023-03-22 linux-04 unstable-01 https://github.com/status-im/nimbus-eth2/commit/c9eb89e9 :heavy_check_mark:
2023-03-30 linux-04 unstable-02 https://github.com/status-im/nimbus-eth2/commit/b4d731a1 :heavy_check_mark:
2023-04-09 linux-01 unstable-01 https://github.com/status-im/nimbus-eth2/commit/b7d08d0a :heavy_check_mark:
2023-04-11 linux-02 unstable-01 https://github.com/status-im/nimbus-eth2/commit/c3d043c0 :heavy_check_mark:
2023-04-16 linux-01 unstable-02 https://github.com/status-im/nimbus-eth2/commit/57623af3 :heavy_check_mark:
2023-04-17 linux-03 unstable-01 https://github.com/status-im/nimbus-eth2/commit/4df851f4 :heavy_check_mark:
2023-04-17 linux-02 unstable-02 https://github.com/status-im/nimbus-eth2/commit/b5115215 :x:
2023-04-18 linux-03 unstable-02 https://github.com/status-im/nimbus-eth2/commit/1d3e8382 :x:
jakubgs commented 1 year ago

The issue started somewhere between https://github.com/status-im/nimbus-eth2/commit/4df851f4 and https://github.com/status-im/nimbus-eth2/commit/b5115215 (including).