solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.05k stars 4.21k forks source link

Some vote account pubkeys and/or validator account pubkeys appear to have intrinsically worse performance than others #21450

Closed Marakaya closed 9 months ago

Marakaya commented 2 years ago

Problem

Initially, my key GtgtQLfqKjn3gaHuH7Fw64n49vr2DrYHiJAsSTNNscAE has been on the 9900K server since May of this year, later, when I discovered that my node votes worse than others, I moved to 5950s in the same data center (Hetzner), after looking at that the number of credits for my node did not increase, I canceled this server and moved to 3900 in the same data center, the number of credits for my node did not increase either. The last experiment was a key exchange with a friend who, every epoch, gains on average 7000-8000 credits more than me, he is located in another data center in Poland. This key is 4GuRZCrg6oChXATfMWTDJ9GNjjc4qKyxzow3UYjvhuqq. The exchange was in the 248 epoch. At least 5 people in our community faced a similar problem in the Solana RU-tech telegram. I will also ask them to describe their problem in this issue with they keys.

Key - GtgtQLfqKjn3gaHuH7Fw64n49vr2DrYHiJAsSTNNscAE with vote account 81iuTYDaeJ71XFGkPXNUuQ8gNvHQfBxpU7Vj2hzJ9Q4e

Key 4GuRZCrg6oChXATfMWTDJ9GNjjc4qKyxzow3UYjvhuqq with vote account AQCqPa2bquEEGW1Z57XmgbfyCAqUgv9yHxSRnCDaU8oQ

Difference between keys in credits: 246 epoch 294658 - 289803 = 4855 247 epoch 280442 - 271825 = 8617 248 epoch 284756 - 275907 = 8849 249 epoch 283171 - 275767 = 7404 250 epoch 294789 - 287803 = 6986 251 epoch 298210 - 290613 = 7587 252 epoch 296590 - 290535 = 6055

Proposed Solution

I do not have the necessary competencies to solve this problem, but I know that in version 1.9 of solana the team found a solution to this problem (if I understood it correctly). If there is a solution, is it possible to speed up its implementation on the testnet?

cattivik66 commented 2 years ago

I can confirm the problem reported. I was running on testnet 3 different servers:

I made some investigations with two keys: AtkFWxD8dEkjSoNHBwMFMMv84WY2ore5XDWWjvLfBZ3q and 8mHseFfqx64WTUFEF2rqAFxwTxsM6bJsPJuHoMFguy2o.

Initially I thought for long time that the issues were something regardin firewalls, hardware or wrong configuration. So I tried many times swapping the keys between all the servers and applying some changes. The key 8mHseFfqx64WTUFEF2rqAFxwTxsM6bJsPJuHoMFguy2o was always overperforming the key AtkFWxD8dEkjSoNHBwMFMMv84WY2ore5XDWWjvLfBZ3q.

I was working on a server to try to get better results, and when I saw an improvement, I replaced the key I was using for troubleshooting (8mHseFfqx64WTUFEF2rqAFxwTxsM6bJsPJuHoMFguy2o) with the final key I wanted to use (AtkFWxD8dEkjSoNHBwM2o) with the final key I wanted to use (AtkFWxD8dEkjSoNHBwM2. Once the key was exchanged, I found that the votes dropped immediately and that the key used for troubleshooting (8mHseFfqx64WTUFEF2rqAFxwTxsM6bJsPJuHoMFguy2o) placed on the server where the other key was previously suddenly generated more credits.

Something strange can be seen also when comparing performance across different validators. If we compare AtkFWxD8dEkjSoNHBwMFMMv84WY2ore5XDWWjvLfBZ3q on https://metrics.stakeconomy.com/d/f2b2HcaGz/solana-community-validator-dashboard?orgId=1&var-pubkey=AtkFWxD8dEkjSoNHBwMFMMv84WY2ore5XDWWjvLfBZ3q&var-server=ts03-testnet&var-inter=1m&var-netif=All&from=now-24h&to=now&refresh=1m with 2P3YH9psWAAM6QQgA8NaQnKHQ973cKNqTSFFCNYE4gjk on https://metrics.stakeconomy.com/d/f2b2HcaGz/solana-community-validator-dashboard?orgId=1&var-pubkey=2P3YH9psWAAM6QQgA8NaQnKHQ973cKNqTSFFCNYE4gjk&var-server=kobzoha-testnet&var-inter=1m&var-netif=All&from=now-6h&to=now&refresh=1m

you can see that, even if the second key is running on a much worse server (CPU usage at +80%, 64 gb of ram) and it generates 2% more credits.

In this epoch the key AtkFWxD8dEkjSoNHBwMFMMv84WY2ore5XDWWjvLfBZ3q is at position around 2700. You can compare it with 8JEabVuVHztGdX55zFYUssDCYH3gpdktWCHJsK7NReqb (https://metrics.stakeconomy.com/d/f2b2HcaGz/solana-community-validator-dashboard?orgId=1&var-pubkey=8JEabVuVHztGdX55zFYUssDCYH3gpdktWCHJsK7NReqb&var-server=triangle13&var-inter=1m&var-netif=All&from=now-24h&to=now&refresh=1m), that currently is at position 1: it has very similar hardware, and very similar stats, but the key AtkFWxD8dEkjSoNHBwMFMMv84WY2ore5XDWWjvLfBZ3q is generating 2.1% less credits than the second one.

You can see the difference in credits generation on this graph: solana_key_credits Both server has CPU usage at around 42%, 128GB ram, IOWAIT around 0, both never rebooted during the current epoch, but if you sort all server by credits the difference between the two keys are around 2700 positions, a difference of ~4.2% votes.

ghost commented 2 years ago

I also confirm this problem. My key is 52gX6aMESU8visvPwFVZBCxrPSkA86fvV43KkLUK56xR. For several eras I was on Hetzenre (nvme ssd / 128 / 3900X) By skip at 2300-2400 positions. Moved to Ikola in order to avoid concentration (nvme ssd / 64! / 3900X) no effect. 2300-2400 position. Now I moved to IONOS (nvme ssd / 128 / Epyc 7302P) - no effect. Credits in the same place. Everywhere the same configuration files, accounts in ramdisk, snapshots in ramdisk, operating system Ubuntu 20 and kernels are the same, the system is optimized according to the recommendations of doc.solana.com. On the other hand, my wife has the key wegaXwEgNQ2CQVvZtLEPLVvGtV5gotwCxMGpGm4xshu. She also moved to all these servers (by transferring the keys) and her cretits, as they were originally in the range of 300-500, have remained. At the same time, for short periods of time, it falls into the top 100 in terms of credits. This cannot be explained by chance factors. Geography of data centers Germany, France and the USA (two locations).

More information. My current datacenter 8560-US-Abington has another testnet validator 8JEabVuVHztGdX55zFYUssDCYH3gpdktWCHJsK7NReqb. I don't know exactly what configuration it has, but there are only two options, either like my Epic 7302P/128 or Ryzen 3900X / 128 or less. So at the moment he is in the 2st place in terms of credits and has never decreased (while I am watching him) below the 10th place, and that apparently at the time of reinstalling the software.

cattivik66 commented 2 years ago

This is an incredible discovery that if it is true, which apparently it is hereby proven based on these posts, that some sort of voodoo curse is being put on these low performance keys by the great Solana tiki gods in the sky hovering over Solana Beach? come on folks, this is a seriously funny post. I been programming and playing star trek on mainframes since 1975 and i thought i heard all the IT jokers, but this one is hilarious. thanks for laughs. maybe the computer you was testing it with was getting vibrated wierdly by a pack of elephants stampeding in india that set of a tsunami that vibrated the undersea fiber cables worldwide cause it was in perfect resonance and that delayed the packet transmissions.

This is a very "useful" post, thanks for your contribution. For the future please try to be more constructive.

bkapolicefund commented 2 years ago

sorry but i pretty much gave up on doing any constructive work for Solana a long time ago becuase of just this type of jabberwocky like this forum here and the posts in it as there is no "laboratory grade test bed" available to determine things like this are scientifically provable. In science, in order to perform a test, you need to have total control of the environment for a test to be conducted and prove ANYTHING. elsewise its all jabberwocky. So you need a lab with 100 PC's all running the exact same version and having the exact same log entries and having the exact same hardware and they must be started and run on same exact network. Then load a pc with the test key number 1 that is showing normal skip rate and run test for one day exactly. Then reset all 101 pc's same was as day before starting test and load test key number 2 that is showing high skip rate and run test one day exactly. Now if you cannot do the test like this then even wasting time jaberwokyin about it is just that - hot air to fill your balloon to fly in...