Closed jakubgs closed 1 year ago
Sync progress is kinda uneven:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3473549","sync_distance":"1616969","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"3592319","sync_distance":"1498199","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3463351","sync_distance":"1627167","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"2694979","sync_distance":"2395539","is_syncing":true,"is_optimistic":true}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*
23G /data/beacon-node-prater-libp2p
23G /data/beacon-node-prater-stable
24G /data/beacon-node-prater-testing
64G /data/beacon-node-prater-unstable
4.0K /data/era
16K /data/lost+found
132G total
There's something weird happening:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1505045","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505045","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505045","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3040637","sync_distance":"2056727","is_syncing":true,"is_optimistic":true}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1505070","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505070","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505070","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3041869","sync_distance":"2055520","is_syncing":true,"is_optimistic":true}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1505075","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505075","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1505075","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3042125","sync_distance":"2055269","is_syncing":true,"is_optimistic":true}}
It seems like the sync_distance
is increasing, and not decreasing. What the hell?
For some reason disk I/O has been going down and down:
Which seems to indicate that we're syncing slower and slower.
But network traffic is steady:
Seems like Fanout health
has gotten worse and the pattern matches I/O drop:
Honestly, it seems stuck:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1511541","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592319","sync_distance":"1511541","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3592321","sync_distance":"1511539","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"3355177","sync_distance":"1748683","is_syncing":true,"is_optimistic":true}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*
23G /data/beacon-node-prater-libp2p
24G /data/beacon-node-prater-stable
24G /data/beacon-node-prater-testing
66G /data/beacon-node-prater-unstable
4.0K /data/era
16K /data/lost+found
135G total
Not sure what to do about that.
Since it seems like ones other than unstable
are stuck I've restarted testing
after removing the SQLite DB:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1519492","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"711","sync_distance":"5111100","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"3945341","sync_distance":"1166470","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"3592319","sync_distance":"1519492","is_syncing":true,"is_optimistic":false}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/*
24G /data/beacon-node-prater-libp2p
24G /data/beacon-node-prater-stable
3.8G /data/beacon-node-prater-testing
82G /data/beacon-node-prater-unstable
4.0K /data/era
16K /data/lost+found
133G total
I found this in stable
node logs:
{
"lvl": "INF",
"ts": "2023-03-03 13:23:24.000+00:00",
"msg": "Slot start",
"topics": "beacnde",
"slot": 5111817,
"epoch": 159744,
"sync": "04d04h22m (70.19%) 4.1865slots/s (QQQQQQQUQQ:3598911)",
"peers": 306,
"head": "b0317895:3592319",
"finalized": "112257:b5bf95d9",
"delay": "352us289ns"
}
So I'm leaving stable
and libp2p
as is.
The Geth node sync seems absurdly fast:
But right now it's doing this:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % tail -n3 /var/log/docker/geth-goerli-03-node/docker.log
INFO [03-06|10:08:22.536] Syncing: chain download in progress synced=100.00% chain=65.26GiB headers=8,559,384@2.87GiB bodies=8,559,319@54.00GiB receipts=8,559,319@8.39GiB eta=2.363s
INFO [03-06|10:08:30.542] Syncing: chain download in progress synced=100.00% chain=65.26GiB headers=8,559,384@2.87GiB bodies=8,559,319@54.00GiB receipts=8,559,319@8.39GiB eta=2.363s
INFO [03-06|10:08:38.546] Syncing: chain download in progress synced=100.00% chain=65.26GiB headers=8,559,384@2.87GiB bodies=8,559,319@54.00GiB receipts=8,559,319@8.39GiB eta=2.363s
Which seems like it's close, but not quite there. Seems kinda stuck since it shows eta=2.363s
over and over again.
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % grep 'eta=2.36' /var/log/docker/geth-goerli-03-node/docker.log | wc -l
66
That's possibly because the unstable
beacon node isn't fully synced yet:
{"data":{"head_slot":"3592319","sync_distance":"1540138","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"1744763","sync_distance":"3387694","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"5076117","sync_distance":"56340","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"3592319","sync_distance":"1540138","is_syncing":true,"is_optimistic":false}}
We can also see that the CPU usage has went down:
I assume it will go further down once all nodes are synced.
I'm tempted to just say the host is good and other the 20 hosts we need to migrate.
Looks like unstable
is fully synced, and so is its Geth node:
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % for port in $(seq 9300 9303); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"3592319","sync_distance":"1546922","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"2251622","sync_distance":"2887619","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"5139241","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"3766675","sync_distance":"1372566","is_syncing":true,"is_optimistic":true}}
admin@linux-01.ih-eu-mda1.nimbus.prater:~ % sudo du -hsc /data/* /docker/*
29G /data/beacon-node-prater-libp2p
24G /data/beacon-node-prater-stable
23G /data/beacon-node-prater-testing
157G /data/beacon-node-prater-unstable
18G /data/era
16K /data/lost+found
9.6M /docker/geth-goerli-01
11M /docker/geth-goerli-02
256G /docker/geth-goerli-03
70G /docker/geth-goerli-04
3.0M /docker/log
16K /docker/lost+found
573G total
And the CPU load is now mostly coming from stable
and libp2p
which are still syncing.
I have asked support what happens if we need a bigger SSD in the future:
We have in stock 800GB and 1.6TB Enterprise SSD only. We can add also 3.84TB SSD but the delivery time is 10 to 20 days. You can upgrade the storage capacity but will have to wait the delivery time.
So there is an option for bigger SSDs, we just have to ask them to purchase them in advance.
It took the unstable
node ~12 days to fully sync and start receiving attestations, from 2023-02-22 to 2023-03-06:
Which is not terrible. I think we can go ahead with the order of 20 servers.
According to Zahary the low mesh health is a red flag:
It should be looking like this: linux-01.he-eu-hel1.nimbus.prater
If we look at host Packet loss in comparison to a Hetzner host it's fine. Just a bit on the 6th:
linux-01.ih-eu-mda1.nimbus.prater
linux-01.he-eu-hel1.nimbus.prater
Network latency looks fine too:
linux-01.ih-eu-mda1.nimbus.prater
linux-01.he-eu-hel1.nimbus.prater
Innova Hosting network seems less consistent but lower latency overall.
And if we look at 1.1.1.1
itself it's actually better than Hetzner:
linux-01.ih-eu-mda1.nimbus.prater
linux-01.he-eu-hel1.nimbus.prater
By whole ~15 miliseconds.
The main difference between hosts is the CPU:
And indeed, the Hetzner host has 2 more cores and they have higher frequency. And we can see some cores being saturated at times:
But that's coming from the nodes that are still syncing, and not already synced, which unstable
node already is:
So I don't think this issue is caused by the CPU, but not 100% sure.
Apparently the issue is that for some reason other nodes are giving us a bad score.
@Menduist will investigate why exactly we are getting a bad score.
It should be looking like this: linux-01.he-eu-hel1.nimbus.prater
Just for reference, when comparing mesh health, synced nodes with similar network, validator count & --subscribe-all-subnets
must be used. And for some reason, people are using the public api on this node with their VC, which falsify the valicator count
beacon-node-prater-libp2p@linux-01.he-eu-hel1.nimbus.prater
(0 validator, no subscribe-all-subnets
, but restarted less often)
beacon-node-prater-unstable@linux-01.ih-eu-mda1.nimbus.prater
(0 validator, no subscribe-all-subnets
)
The synced part of the second graph does seem more chaotic than the first node
Started to look into this, the flickering topics are the light clients one, not sure why yet.
AFAICT, the light clients topic don't have any REJECT
rule, so it's not a case of us sending invalid stuff
I opened a PR to add a metric that will tell us if the new nodes are missing light client messages: https://github.com/status-im/nimbus-eth2/pull/4745
When connecting a node on my laptop to beacon-node-prater-unstable@linux-01.ih-eu-mda1.nimbus.prater
, they stay in mesh without issue
Btw, the node on my laptop also has issues to keep peers in the light clients topics, so it doesn't seem specific to the server
I'll continue looking into this at some point, but TLDR not a blocker to switch provider, it's just random depending on who are we connected to @jakubgs
I have contacted their support to discuss costs and payment options:
I have discussed the server with the team and it looks like we are good to go with a purchase of 20 servers like that. Now the questions are:
- What would be the monthly price?
- What would be the quarterly price?
- What would be the annual price?
- Would it be possible to pay via bank transfer annually?
Looks like the best they can offer right now is 5% discount if we pay for a year:
The price for one month is 173Euro/month
If you pay for one year i can make 5% discount
PS:I will have to check with our support team if we have 20 servers available for activation. If not I will tell you have many servers we have available now and the delivery time for the quantity that we do not have in stock but usualy is about 10 days
That's a bit low if you ask me. And I don't worry about the delay as I will be away all of next week.
I've asked them about the final amount and preferred crypto:
Ok, so my understanding is that a bulk order of 20 server with an annual payment plan using crypto with a total 10% discount will cost:
17320 * 12 * 0.9 37368.0 EUR
Is this correct? Also, what cryptocurrencies are you willing to accept?
And I've invited deivids@status.im
to our org so he can discuss the payment details with them.
Got a response about crypto they accept:
The only crypto that can be processed automaticaly on our website is Bitcoin, but we do accept and process the transaction manually for USDT and ETHs
And for use USDT will work out fine. I asked for the invoice.
BTW, we should test a node with --subscribe-all-subnets=true
to see if the CPU follows along, since they are apparently less powerful?
I enabled it on all 4 nodes and this is what we get:
It appears to work fine with about 3-fold increase in average load, but not going beyond 7, when we have 10 cores.
Should be fine. But lets take a look at the node graphs too.
We get more disk I/O but not that much:
These graphs look worse now, but that might be temporary, probably:
Looks like stable
it's recovering:
unstable
looks good already:
Got a response from InnovaHosting about status of our order:
We did not receive all the components yet, we will receive in a few days
We can now activate about less than half of your order and in a few days we can activate the rest of the order
Should we activate the available servers now and an the rest of the m in a few days or we wait to activate all together?
I told the the sooner the better, so we can start the setup.
We finally received some of the servers:
I will deploy them for nimbus.mainnet
fleet, since that's the one that's low on disk space.
For some reason they are giving us hosts with different CPUs and different numbers of them:
> a ih-eu-mda1 --become -a 'dmidecode -t processor | grep Version'
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Version: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
Version: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
linux-01.ih-eu-mda1.nimbus.prater | CHANGED | rc=0 >>
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Version:
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
This is most probably detrimental to our purpose of testing software on consistent environments. I will ask their support about this.
I bootstrapped the first three hosts and deployed the old setup with 6 nodes on each host:
But I want to hold off on deploying notes to hear from support about the weird CPU layout. I think we'd rather have a homogeneous setup with the same hardware on each host to avoid tiny differences that could affect our ability to consistently compare behavior of nodes between hosts.
Their response about mixed CPU layouts is:
For us is hard to provide 20 identical servers with same CPU, same RAM, etc. We may provide 2-3 different configs but with close to each other cpu models, with 8-10 CPU cores, 2.4-3.0Ghz frequency.
If you need absolutely identical 20 servers, then we have to agree on configuration, after that we will try to freeup/purchase exact configuration, but it will be for couple time and will depend, not all components may be available on our suppliers.
No, we didn't charge something additional for double CPU, but if you dont need Dual CPU, then we may provide server with only 1 CPU installed.
I've asked them to just install just a single Xeon E5-2690 v2 in all hosts for now.
Their support has reconfigured the hosts to now how only a single Xeon E5-2690 v2:
> a ih-eu-mda1 --become -a 'dmidecode -t processor | grep "Version: Intel"'
linux-01.ih-eu-mda1.nimbus.prater | CHANGED | rc=0 >>
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Version: Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
And also a 5th host has been provisioned for us.
I added the 4th and 5th mainnet hosts and reverted temporary node layout changes:
Since we're going to be discontinuing or re-using the Hetzner hosts I've made a list of what we currently have:
ID | Hostname | Model | CPU | Extra |
---|---|---|---|---|
1424980 |
goerli-01.he-eu-hel1.nimbus.geth |
AX41 | Ryzen 5 3600 6-Core | 1 TB NVMe SSD |
1485912 |
goerli-02.he-eu-hel1.nimbus.geth |
AX41 | Ryzen 5 3600 6-Core | 1 TB NVMe SSD |
1485914 |
goerli-03.he-eu-hel1.nimbus.geth |
AX41 | Ryzen 5 3600 6-Core | 1 TB NVMe SSD |
1660565 |
linux-01.he-eu-hel1.nimbus.sepolia |
AX41 | Ryzen 5 3600 6-Core | |
1787515 |
linux-01.he-eu-hel1.nimbus.prater |
AX61 | Ryzen 9 3900 12-Core | |
1787528 |
linux-02.he-eu-hel1.nimbus.prater |
AX61 | Ryzen 9 3900 12-Core | |
1787530 |
linux-03.he-eu-hel1.nimbus.prater |
AX61 | Ryzen 9 3900 12-Core | |
1787532 |
linux-04.he-eu-hel1.nimbus.prater |
AX61 | Ryzen 9 3900 12-Core | |
1787547 |
linux-05.he-eu-hel1.nimbus.prater |
AX61 | Ryzen 9 3900 12-Core | 1.92 TB NVMe SSD |
1787573 |
linux-06.he-eu-hel1.nimbus.prater |
AX61 | Ryzen 9 3900 12-Core | 1.92 TB NVMe SSD |
1517896 |
metal-01.he-eu-hel1.nimbus.eth1 |
AX41 | Ryzen 5 3600 6-Core | 2 TB NVMe SSD |
1674260 |
metal-01.he-eu-hel1.nimbus.fluffy |
AX41 | Ryzen 5 3600 6-Core | |
1674261 |
metal-02.he-eu-hel1.nimbus.fluffy |
AX41 | Ryzen 5 3600 6-Core | |
1551432 |
metal-01.he-eu-hel1.nimbus.mainnet |
AX41 | Ryzen 5 3600 6-Core | 2 TB NVMe SSD |
1551433 |
metal-02.he-eu-hel1.nimbus.mainnet |
AX41 | Ryzen 5 3600 6-Core | 2 TB NVMe SSD |
1551434 |
metal-03.he-eu-hel1.nimbus.mainnet |
AX41 | Ryzen 5 3600 6-Core | 2 TB NVMe SSD |
1551436 |
metal-04.he-eu-hel1.nimbus.mainnet |
AX41 | Ryzen 5 3600 6-Core | 2 TB NVMe SSD |
1551437 |
metal-05.he-eu-hel1.nimbus.mainnet |
AX41 | Ryzen 5 3600 6-Core | 2 TB NVMe SSD |
1551438 |
metal-06.he-eu-hel1.nimbus.mainnet |
AX41 | Ryzen 5 3600 6-Core | 2 TB NVMe SSD |
1890187 |
metal-07.he-eu-hel1.nimbus.mainnet |
AX41 | Ryzen 5 3600 6-Core | 2 TB NVMe SSD |
1666168 |
windows-01.he-eu-hel1.nimbus.prater |
AX41 | Ryzen 5 3600 6-Core | Windows Server 2019 |
As we can see we have 6 Prater testnet AX61 hosts with Ryzen 9 3900 12-Core which would probably work well in our CI.
There's something wrong with unstable
branch. It doesn't want to sync from a trusted node properly:
admin@linux-01.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"1003256","sync_distance":"5252945","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6256200","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256201","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256200","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256004","sync_distance":"197","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6252672","sync_distance":"3529","is_syncing":true,"is_optimistic":false}}
I synced ERA files from our old hosts and as we can see stable-02
, testing-01
, and testing-02
have synced fine using a trusted node API endpoint. I have left stable-01
to sync fully without using a trusted node, but both unstable
nodes are having a hard time syncing.
Same thing we can see on linux-02
:
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9300 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"968867","sync_distance":"5287335","is_syncing":true,"is_optimistic":true}}
{"data":{"head_slot":"6256201","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256201","sync_distance":"1","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6256202","sync_distance":"0","is_syncing":false,"is_optimistic":false}}
{"data":{"head_slot":"6252789","sync_distance":"3413","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6252725","sync_distance":"3477","is_syncing":true,"is_optimistic":false}}
They seem to be going backwards.
Some Hetzner unstable
nodes have the same issue where sync_distance
is growing instead of shrinking:
admin@metal-02.he-eu-hel1.nimbus.mainnet:~ % c 0:9304/eth/v1/node/syncing | jq -c
{"data":{"head_slot":"6250839","sync_distance":"5431","is_syncing":true,"is_optimistic":false}}
admin@metal-02.he-eu-hel1.nimbus.mainnet:~ % c 0:9304/eth/v1/node/syncing | jq -c
{"data":{"head_slot":"6250839","sync_distance":"5434","is_syncing":true,"is_optimistic":false}}
admin@metal-02.he-eu-hel1.nimbus.mainnet:~ % c 0:9304/eth/v1/node/syncing | jq -c
{"data":{"head_slot":"6250839","sync_distance":"5435","is_syncing":true,"is_optimistic":false}}
@tersec suggested to try adding --sync-light-client=off
flag based on possible impact of this PR being merged last week:
Got another response from support:
From our supplier we gonna purchase following CPU models:
INTEL XEON E5-2643 v4 3.40GHz 20MB 6-CORE CPU 20 pieces INTEL XEON E5-2667 v3 3.20GHz 20MB 8-CORE CPU 8 pieces
I want to know if this CPU will be good for you, because at the moment, there are CPU shortage of E5-2690 v2, we have only 4 CPU left for today.
Seems to me like E5-2667 v3 is a better option based on this:
But I'm worried it has less threads than the E5-2690 v2. Maybe we could use the E5-2690 v2 for prater and E5-2667 v3 for mainnet.
I tried with --sync-light-client=off
but we are still gaining sync distance:
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"6256544","sync_distance":"223","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6256513","sync_distance":"254","is_syncing":true,"is_optimistic":false}}
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"6256544","sync_distance":"228","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6256513","sync_distance":"259","is_syncing":true,"is_optimistic":false}}
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % for port in $(seq 9304 9305); do c 0:$port/eth/v1/node/syncing | jq -c; done
{"data":{"head_slot":"6256544","sync_distance":"231","is_syncing":true,"is_optimistic":false}}
{"data":{"head_slot":"6256513","sync_distance":"262","is_syncing":true,"is_optimistic":false}}
If we check the Slot start
log messages we see that not much in them is changing:
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % grep 'Slot start' /var/log/service/beacon-node-mainnet-unstable-02/service.log | grep '"finalized":"195514:c6856d5f"' | wc -l
107
But slot
is growing:
admin@linux-02.ih-eu-mda1.nimbus.mainnet:~ % grep 'Slot start' /var/log/service/beacon-node-mainnet-unstable-02/service.log | jq -c '{ slot, epoch, head, finalized }' | sort -u | tail -n5
{"slot":6256939,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
{"slot":6256940,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
{"slot":6256941,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
{"slot":6256942,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
{"slot":6256943,"epoch":195529,"head":"e6a490a9:6256513","finalized":"195514:c6856d5f"}
After talking with @tersec we've decided to do a dumb bisect of commits. We'll use the current state of testing
/stable
as a starting point, which gives us a difference of 75 commits. This should not require too many tests.
Here's the layout I'm going with for first round and the results:
Date | Host | Node | Commit | Result |
---|---|---|---|---|
2023-03-22 |
linux-04 |
unstable-01 |
https://github.com/status-im/nimbus-eth2/commit/c9eb89e9 | :heavy_check_mark: |
2023-03-30 |
linux-04 |
unstable-02 |
https://github.com/status-im/nimbus-eth2/commit/b4d731a1 | :heavy_check_mark: |
2023-04-09 |
linux-01 |
unstable-01 |
https://github.com/status-im/nimbus-eth2/commit/b7d08d0a | :heavy_check_mark: |
2023-04-11 |
linux-02 |
unstable-01 |
https://github.com/status-im/nimbus-eth2/commit/c3d043c0 | :heavy_check_mark: |
2023-04-16 |
linux-01 |
unstable-02 |
https://github.com/status-im/nimbus-eth2/commit/57623af3 | :heavy_check_mark: |
2023-04-17 |
linux-03 |
unstable-01 |
https://github.com/status-im/nimbus-eth2/commit/4df851f4 | :heavy_check_mark: |
2023-04-17 |
linux-02 |
unstable-02 |
https://github.com/status-im/nimbus-eth2/commit/b5115215 | :x: |
2023-04-18 |
linux-03 |
unstable-02 |
https://github.com/status-im/nimbus-eth2/commit/1d3e8382 | :x: |
The issue started somewhere between https://github.com/status-im/nimbus-eth2/commit/4df851f4 and https://github.com/status-im/nimbus-eth2/commit/b5115215 (including).
Since Hetzner hates crypto mining and states so in their policy:
https://www.hetzner.com/legal/dedicated-server/
We will need to migrate out hosts to an alternative server provider soon. To do that we need to find a suitable alternative.
The general requirement involve:
We'll need about 21 hosts with similar specs.