solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.2k stars 4.28k forks source link

~1% of nodes are repairing almost every shred in almost every slot #33166

Open behzadnouri opened 1 year ago

behzadnouri commented 1 year ago

Problem

Based on shred_insert_is_full metric from mainnet-beta, it seems like ~1% of nodes are repairing almost every shred in every slot.

Proposed Solution


In the time interval

2023-09-06T13:59:28Z
2023-09-06T14:59:27Z

(7907 slots), the offending nodes seem to be:

node | number of slots fully repaired
8c4dZiN5MQQKVRjL3fpMx4cHbuU91XSjWKLTq4w2pneW    7907
EXAG1GJgsQKfML6DwdJbJ4hpotyDYhH1ZGuDhdNZZTn4    7907
5MC2bsQWTbKSCG6aZhTTGwCk7kzkoLH5YjgZFkn3vN4c    7907
Aowy8a3Yih7h4fVuBJ9m4zTSTR5RbutGihVgztkDQLPe    7907
3UrNqi8XYh5bh8CE4JF1256fLKX8eenrAXQJ4Y75Do6c    7907
HUjU6MUEtpzj8PpLEqFPLGC5VynkWfHkXZpskEfM6GdY    7907
BmjKDyDbPgkt8Pz7F5tr2GNFKqfg7PA9wPJjevcFyJ9Y    7907
EbuSoLTjhTFZbujGCvhtFgvaMbXjAVQPL3bUpGD62D4M    7907
A85mL9EUKnq8r52WpyHMLuZw77bv4buZrbAu8cx8BHjg    7907
DZ2omywE3SYm2SQXMWwGbcVUTWcjjZYA9RJxq7h25rif    7907
77RUN7CWAtQCVpdBcSkNU7jJaKtxy6rCcss2y9QBiMMB    7907
HTTod6mktmDvM5B1fHdyf8uUTaAL9HLTY9MaLRgCDrPd    7907
CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p    7907
C4vi5tE3tTpBLMheM55t7NRinp9fvMywqaiQhr5YUSy8    7907
AE88LqgCpV7i3GD3JcW9FLGA48N32UsuSagtiPVzHAtH    7907
2NCzepxVangimxxDGLBiMYFfPS5ioRKr4aij6Rf668rg    7907
87T8k8EXRPxsTpF8TXGSWVSf4Bheaz4GEy73BPwm2CwH    7907
53qAF5HupqQMWa6Dy1Rsd6c3FDGrTo1hec5itK4YZGov    7906
9xkVy66XvA5VtMdsVHmqkBefoeQh7Hb4sJhMMhfJUULD    7906
6XkchfsH5LYpPpd9uPDyoKQtYZXPgK3h94nWjJZdwtdc    7905
7AVXHqRUU7xZDkQvGikrsTbjBeAYcKHz4ArRfECkz71E    7904
Fc6NNdS2j3EmrWbU6Uqt6wsKB5ef72NjaWfNxKYbULGD    7902
9XV9hf2oBdJdrcPaLWLALpnCRu9t5Cozd4hJWMZnF4iu    7884
dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP     7811
Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn    7798
Eye7jGJ4XypieFsVGBwm8VijbM5qPRJBU7nxGNwpukvT    7750
6zAsEYWbtV4skvZGM8Vddd2XYECYatodyxq3asjRyJTY    6986
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u    6190
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF    6107
4oenxERWkYjZfmoTeRPRgXmMF9kkqUzgKSPjxL6gP67u    6067

Note that majority of above nodes are staked, in particular

Identity                                      Vote Account                            Commission  Last Vote        Root Slot     Skip Rate  Credits  Version            Active Stake
8c4dZiN5MQQKVRjL3fpMx4cHbuU91XSjWKLTq4w2pneW  G2b1Pi9kf3KFX73V7kD3PdPxYL3iJxc9WubMa6Az1kWs   10%  215952431 ( -4)  215952400 ( -4)    -      365423  1.14.23     66268.738184708 SOL (0.02%)
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF  576M4qCLi4H3yzcvADMMjNA8bDrJuiYZz2XCXMxGvaPe   10%  215952435 (  0)  215952404 (  0)   6.25%   364401  1.14.24     67200.146615542 SOL (0.02%)
Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn  DDCWKg1piRArvzrfu3QC9bPKANz71LpYRuQaSJ2FnNQy    8%  215952433 ( -2)  215952402 ( -2)   2.08%   372696   1.16.9     67640.318607768 SOL (0.02%)
2NCzepxVangimxxDGLBiMYFfPS5ioRKr4aij6Rf668rg  7QFPC5o5dmJaKwntE8bu2ohvT91y9eouXAMRMM1Y7VRP   10%  215952433 ( -2)  215952402 ( -2)   0.00%   373212  1.14.23     68507.047162100 SOL (0.02%)
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u  A2XnPCfYMt8inTUX3DhbVZ6ZW7pmXquLKJ1knGupjaKS   10%  215952432 ( -3)  215952401 ( -3)   8.93%   363133  1.14.24     68699.255295438 SOL (0.02%)
Fc6NNdS2j3EmrWbU6Uqt6wsKB5ef72NjaWfNxKYbULGD  2tucttroqFNXsrYeMBQ8LfzKNfgwT2rHBzAF6RzbbHEp    0%  215952430 ( -5)  215952399 ( -5)  68.42%   372392  1.14.24    200318.059829372 SOL (0.05%)

cc @jbiseda @bw-solana

jbiseda commented 1 year ago

4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF is interesting. It was relying solely on repair, then operated normally for ~3 days before relying on repair again. Screenshot 2023-09-06 at 10 49 22 AM

SELECT mean("num_recovered") AS "mean_num_recovered", mean("num_repaired") AS "mean_num_repaired", mean("last_index") AS "mean_last_index" FROM "mainnet-beta"."autogen"."shred_insert_is_full" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF' GROUP BY time(:interval:) FILL(null)

Looks like this node upgrade/downgraded at the inflection points.

| time | validator-new.cluster_type | validator-new.id | validator-new.version |
| ---- | -------------------------- | ---------------- | --------------------- |
| 09/02/2023 07:10:05 | | 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF | 1.16.10 (src:382b3f16; feat:3712769919, client:SolanaLabs) |
| 09/05/2023 16:21:26 | | 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF | 1.14.24 (src:605da4a3; feat:1879391783) |

Incoming repair traffic is also reduced during the time we rely on repair. Perhaps ports changed and were not reflected in gossip? Screenshot 2023-09-06 at 12 13 36 PM

SELECT mean("total_requests") AS "mean_total_requests" FROM "mainnet-beta"."autogen"."serve_repair-requests_received" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF' GROUP BY time(:interval:) FILL(null)

It appears that this node was started with --no-os-network-stats-reporting.

bw-solana commented 1 year ago

Looking at 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF over the last hour, I'm not seeing any evidence that turbine shreds are landing at all. Every successful blockstore insertion came via repair. We are seeing some shreds fail to land because the shred already exists, and these could in theory be coming from turbine, but I doubt it.

Reason I doubt this is because you would think at least 1 turbine data shred would make it before repair. Metrics also indicate that zero coding shreds are getting inserted, which makes sense given we don't repair coding shreds.

image

It also appears that shred_fetch.shred_count is stuck at zero while shred_fetch_repair.shred_count accounts for everything going into blockstore: image

jbiseda commented 1 year ago

Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn looks similar in that it stopped receiving turbine shreds following restart: Screenshot 2023-09-06 at 11 54 27 AM

SELECT mean("num_recovered") AS "mean_num_recovered", mean("num_repaired") AS "mean_num_repaired", mean("last_index") AS "mean_last_index" FROM "mainnet-beta"."autogen"."shred_insert_is_full" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn' GROUP BY time(:interval:) FILL(null)

Restart at:

08/26/2023 11:19:25
Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn1.16.9 (src:2c20d87b; feat:3712769919, client:SolanaLabs)
SELECT "cluster_type", "id", "version" FROM "mainnet-beta"."autogen"."validator-new" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn'

Network traffic: Screenshot 2023-09-06 at 1 04 18 PM

SELECT mean("in_datagrams_delta") AS "mean_in_datagrams_delta", mean("out_datagrams_delta") AS "mean_out_datagrams_delta" FROM "mainnet-beta"."autogen"."net-stats-validator" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn' GROUP BY time(:interval:) FILL(null)

Interestingly after restart we don't see recv errors. Screenshot 2023-09-06 at 1 17 48 PM

SELECT mean("in_errors_delta") AS "mean_in_errors_delta", mean("rcvbuf_errors_delta") AS "mean_rcvbuf_errors_delta", mean("in_csum_errors_delta") AS "mean_in_csum_errors_delta", mean("no_ports_delta") AS "mean_no_ports_delta", mean("sndbuf_errors") AS "mean_sndbuf_errors" FROM "mainnet-beta"."autogen"."net-stats-validator" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn' GROUP BY time(:interval:) FILL(null)

Raw error count also went down, but not to 0 as would be expected for machine restart. Could this happen with NIC removal? Screenshot 2023-09-06 at 1 24 09 PM

SELECT mean("in_errors_delta") AS "mean_in_errors_delta", mean("rcvbuf_errors_delta") AS "mean_rcvbuf_errors_delta", mean("in_csum_errors_delta") AS "mean_in_csum_errors_delta", mean("no_ports_delta") AS "mean_no_ports_delta", mean("sndbuf_errors") AS "mean_sndbuf_errors", mean("in_errors") AS "mean_in_errors", mean("rcvbuf_errors") AS "mean_rcvbuf_errors" FROM "mainnet-beta"."autogen"."net-stats-validator" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn' GROUP BY time(:interval:) FILL(null)
jbiseda commented 1 year ago

dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP also has repair behavior change around validator restart:

Screenshot 2023-09-06 at 4 29 32 PM

SELECT mean("last_index") AS "mean_last_index", mean("num_recovered") AS "mean_num_recovered", mean("num_repaired") AS "mean_num_repaired" FROM "mainnet-beta"."autogen"."shred_insert_is_full" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP' GROUP BY time(:interval:) FILL(null)

SELECT mean("shred_count") AS "mean_shred_count" FROM "mainnet-beta"."autogen"."shred_fetch" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP' GROUP BY time(:interval:) FILL(null)

SELECT mean("total_requests") AS "mean_total_requests" FROM "mainnet-beta"."autogen"."serve_repair-requests_received" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP' GROUP BY time(:interval:) FILL(null)
time validator-new.id validator-new.version
08/25/2023 12:49:58 dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP 1.16.9 (src:00000000; feat:3712769919, client:SolanaLabs)
08/28/2023 07:25:03 dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP 1.16.9 (src:00000000; feat:3712769919, client:SolanaLabs)
08/29/2023 04:53:33 dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP 1.16.9 (src:00000000; feat:3712769919, client:SolanaLabs)
08/29/2023 09:58:53 dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP 1.14.24 (src:devbuild; feat:1879391783)
SELECT "id", "version" FROM "mainnet-beta"."autogen"."validator-new" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP'

Prior to the first restart on 8/25 the validator was receiving some turbine shreds and some incoming repair requests. Following the restart no incoming turbine shreds or repair requests are seen.

behzadnouri commented 1 year ago

Looking at the gossip spy data I see a bunch of above nodes broadcasting localhost as their IP address

IP Address        |Age(ms)| Node identifier                              | Version |Gossip|TPUvote| TPU  |TPUfwd| TVU  |TVUfwd|Repair|ServeR|ShredVer
------------------+-------+----------------------------------------------+---------+------+-------+------+------+------+------+------+------+--------
109.194.51.129    |   234 | 4oenxERWkYjZfmoTeRPRgXmMF9kkqUzgKSPjxL6gP67u | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.137    |    68 | G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.177    |   546 | AE88LqgCpV7i3GD3JcW9FLGA48N32UsuSagtiPVzHAtH | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.63     |    82 | Eye7jGJ4XypieFsVGBwm8VijbM5qPRJBU7nxGNwpukvT | 1.14.20 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.64     |  1029 | 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.78     |  1216 | dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.88     |    62 | 9xkVy66XvA5VtMdsVHmqkBefoeQh7Hb4sJhMMhfJUULD | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.91     |    66 | EbuSoLTjhTFZbujGCvhtFgvaMbXjAVQPL3bUpGD62D4M | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
127.0.0.1         |   132 | BmjKDyDbPgkt8Pz7F5tr2GNFKqfg7PA9wPJjevcFyJ9Y | 1.14.23 | 9031 |  9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1         |   194 | 77RUN7CWAtQCVpdBcSkNU7jJaKtxy6rCcss2y9QBiMMB | 1.14.24 | 8001 |  9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1         |   292 | 7AVXHqRUU7xZDkQvGikrsTbjBeAYcKHz4ArRfECkz71E | 1.14.23 | 9031 |  9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1         |   653 | 6XkchfsH5LYpPpd9uPDyoKQtYZXPgK3h94nWjJZdwtdc | 1.14.23 | 9031 |  9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1         |   992 | 8c4dZiN5MQQKVRjL3fpMx4cHbuU91XSjWKLTq4w2pneW | 1.14.23 | 9031 |  9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1         |  1086 | Aowy8a3Yih7h4fVuBJ9m4zTSTR5RbutGihVgztkDQLPe | 1.14.24 | 8001 |  9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1         |  1244 | DZ2omywE3SYm2SQXMWwGbcVUTWcjjZYA9RJxq7h25rif | 1.14.23 | 9031 |  9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1         |  2070 | C4vi5tE3tTpBLMheM55t7NRinp9fvMywqaiQhr5YUSy8 | 1.14.24 | 8001 |  9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1         |  2114 | 5MC2bsQWTbKSCG6aZhTTGwCk7kzkoLH5YjgZFkn3vN4c | 1.14.23 | 9031 |  9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1         |  2464 | 53qAF5HupqQMWa6Dy1Rsd6c3FDGrTo1hec5itK4YZGov | 1.14.24 | 8001 |  9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1         | 11082 | EXAG1GJgsQKfML6DwdJbJ4hpotyDYhH1ZGuDhdNZZTn4 | 1.14.24 | 8001 |  9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1         | 20517 | 3UrNqi8XYh5bh8CE4JF1256fLKX8eenrAXQJ4Y75Do6c | 1.14.24 | 8001 |  9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
129.213.40.161    |   178 | 87T8k8EXRPxsTpF8TXGSWVSf4Bheaz4GEy73BPwm2CwH | 1.14.20 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
146.19.24.227     |     0 | 6zAsEYWbtV4skvZGM8Vddd2XYECYatodyxq3asjRyJTY | 1.16.9  | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
185.6.13.180      |  8352 | HTTod6mktmDvM5B1fHdyf8uUTaAL9HLTY9MaLRgCDrPd | 1.14.17 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
185.6.13.189      |  2274 | 9XV9hf2oBdJdrcPaLWLALpnCRu9t5Cozd4hJWMZnF4iu | 1.14.17 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
217.110.201.116   |  1514 | Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn | 1.16.9  | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
35.75.180.30      |     6 | Fc6NNdS2j3EmrWbU6Uqt6wsKB5ef72NjaWfNxKYbULGD | 1.14.24 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
5.199.170.85      |   537 | CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p | 1.14.24 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
66.45.238.30      |  1508 | 2NCzepxVangimxxDGLBiMYFfPS5ioRKr4aij6Rf668rg | 1.14.23 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
69.166.3.194      |  2010 | HUjU6MUEtpzj8PpLEqFPLGC5VynkWfHkXZpskEfM6GdY | 1.14.26 | 8001 |  8006 | 8004 | 8005 | 8002 | 8003 | 8008 | 8009 | 56177
69.166.3.195      |  1520 | A85mL9EUKnq8r52WpyHMLuZw77bv4buZrbAu8cx8BHjg | 1.14.26 | 8001 |  8006 | 8004 | 8005 | 8002 | 8003 | 8008 | 8009 | 56177
behzadnouri commented 1 year ago
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF

switch between turbine and repair quite frequently without a node restart. I checked some logs and they indeed send repair requests from the same IP address; so very weird why they stop receiving shreds from turbine. Also no net-stats-validator metric for either of the nodes. :thinking:

IP Address        |Age(ms)| Node identifier                              | Version |Gossip|TPUvote| TPU  |TPUfwd| TVU  |TVUfwd|Repair|ServeR|ShredVer
109.194.51.137    |    68 | G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.64     |  1029 | 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF | 1.14.24 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177

Identity                                      Vote Account                            Commission  Last Vote        Root Slot     Skip Rate  Credits  Version            Active Stake
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u  A2XnPCfYMt8inTUX3DhbVZ6ZW7pmXquLKJ1knGupjaKS   10%  216131323 ( -3)  216131292 ( -3)   0.00%   127398  1.14.24     68646.522416858 SOL (0.02%)
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF  576M4qCLi4H3yzcvADMMjNA8bDrJuiYZz2XCXMxGvaPe   10%  216131319 ( -7)  216131288 ( -7)   0.00%   127524  1.14.24     67146.939143975 SOL (0.02%)

G9qsqKn-shred-fetch 4vX1fgJnj-shred-fetch

bw-solana commented 1 year ago
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF

switch between turbine and repair quite frequently without a node restart. I checked some logs and they indeed send repair requests from the same IP address; so very weird why they stop receiving shreds from turbine. Also no net-stats-validator metric for either of the nodes. 🤔

Interesting... throwing out some possibilities:

One interesting thing I'm observing is a bunch of QUIC connection activity at the start of the period where nodes begin receiving turbine shreds: image

I believe we're still using UDP for Turbine, right? How about repair?

behzadnouri commented 1 year ago

Node gets dropped from turbine tree somehow

For staked nodes we always include them in the tree even if we don't have their gossip contact-info. So this is unlikely, since these nodes are staked, unless the are staked very recently and not currently yet in the epoch-stakes.

IP being used in turbine is stale/incorrect

From a 3rd validator logs, I can see they are sending repair responses to the same IP address, so seems like the IP address is ok. Might be some issues at the tvu port.

[2023-09-07T14:09:08.895757045Z INFO  solana_streamer::streamer] streamer send Repair hosts: count:1000 [(109.194.51.64, SendStats { bytes: 36210, count: 30 }), (34.142.67.255, SendStats { bytes: 35003, count: 29 }), (109.194.51.137, SendStats { bytes: 33796, count: 28 }), (135.181.240.199, SendStats { bytes: 33796, count: 28 }), (109.194.51.78, SendStats { bytes: 32589, count: 27 })]
[2023-09-07T14:15:07.437366168Z INFO  solana_streamer::streamer] streamer send Repair hosts: count:1000 [(109.194.51.137, SendStats { bytes: 21726, count: 18 }), (45.250.255.105, SendStats { bytes: 18105, count: 15 }), (208.91.107.14, SendStats { bytes: 18105, count: 15 }), (35.75.180.30, SendStats { bytes: 22933, count: 19 }), (202.8.9.108, SendStats { bytes: 20519, count: 17 })]

I believe we're still using UDP for Turbine, right? How about repair?

yeah, it is all UDP. No quic code on v1.14 or v1.16 branch.

bw-solana commented 1 year ago

Really interesting data point that I don't know what to make of... The periods that 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF and G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u seem to be receiving turbine shreds are correlated with their leader slots. This also explains why we see the QUIC activity jump around these periods.

This means the next time we should see 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF start receiving turbine shreds again is around slot 216226940 and G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u is slot 216179224

bw-solana commented 1 year ago
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF

switch between turbine and repair quite frequently without a node restart.

Looks like both of these nodes are in 59713-RU-Kursk

behzadnouri commented 1 year ago

I sent some random packets to TVU socket and repair socket of below nodes

IP Address        |Age(ms)| Node identifier                              | Version |Gossip|TPUvote| TPU  |TPUfwd| TVU  |TVUfwd|Repair|ServeR|ShredVer
------------------+-------+----------------------------------------------+---------+------+-------+------+------+------+------+------+------+--------
109.194.51.78     |   225 | dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP  | 1.14.27 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
129.213.40.161    |   226 | 87T8k8EXRPxsTpF8TXGSWVSf4Bheaz4GEy73BPwm2CwH | 1.14.20 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
185.6.13.180      |   116 | HTTod6mktmDvM5B1fHdyf8uUTaAL9HLTY9MaLRgCDrPd | 1.14.17 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
185.6.13.189      |  3926 | 9XV9hf2oBdJdrcPaLWLALpnCRu9t5Cozd4hJWMZnF4iu | 1.14.17 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
217.110.201.116   |   502 | Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn | 1.14.27 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
35.75.180.30      |   550 | Fc6NNdS2j3EmrWbU6Uqt6wsKB5ef72NjaWfNxKYbULGD | 1.14.24 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
5.199.170.85      |   199 | CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p | 1.14.27 | 8000 |  8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
66.45.238.30      |   152 | 2NCzepxVangimxxDGLBiMYFfPS5ioRKr4aij6Rf668rg | 1.14.27 | 8001 |  8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
69.166.3.194      |   192 | HUjU6MUEtpzj8PpLEqFPLGC5VynkWfHkXZpskEfM6GdY | 1.14.26 | 8001 |  8006 | 8004 | 8005 | 8002 | 8003 | 8008 | 8009 | 56177
69.166.3.195      |    51 | A85mL9EUKnq8r52WpyHMLuZw77bv4buZrbAu8cx8BHjg | 1.14.26 | 8001 |  8006 | 8004 | 8005 | 8002 | 8003 | 8008 | 8009 | 56177

and none show up in the metrics. Tested two other good nodes as the control group, and those indeed showed clear spike in the metrics.

Quite odd that even the metrics from their repair socket do not show a spike from the random packets, even though they are able to receive repair responses on that socket.

jbiseda commented 1 year ago

Quite odd that even the metrics from their repair socket do not show a spike from the random packets, even though they are able to receive repair responses on that socket.

This could be due to NAT: incoming packets from remote address are only permitted following outgoing packet to that address.

behzadnouri commented 1 year ago

Two of the nodes:

CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p
Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn

have identified issues in their firewall configuration and have updated the firewall to open respective ports/protocols.

Frixoe commented 1 year ago

This is CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p.

We realized that we hadn't updated our firewall config to open the following ports:

8000-8020/udp
8001/tcp
8000/tcp

Apologies for the trouble.