Open behzadnouri opened 1 year ago
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF
is interesting. It was relying solely on repair, then operated normally for ~3 days before relying on repair again.
SELECT mean("num_recovered") AS "mean_num_recovered", mean("num_repaired") AS "mean_num_repaired", mean("last_index") AS "mean_last_index" FROM "mainnet-beta"."autogen"."shred_insert_is_full" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF' GROUP BY time(:interval:) FILL(null)
Looks like this node upgrade/downgraded at the inflection points.
| time | validator-new.cluster_type | validator-new.id | validator-new.version |
| ---- | -------------------------- | ---------------- | --------------------- |
| 09/02/2023 07:10:05 | | 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF | 1.16.10 (src:382b3f16; feat:3712769919, client:SolanaLabs) |
| 09/05/2023 16:21:26 | | 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF | 1.14.24 (src:605da4a3; feat:1879391783) |
Incoming repair traffic is also reduced during the time we rely on repair. Perhaps ports changed and were not reflected in gossip?
SELECT mean("total_requests") AS "mean_total_requests" FROM "mainnet-beta"."autogen"."serve_repair-requests_received" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF' GROUP BY time(:interval:) FILL(null)
It appears that this node was started with --no-os-network-stats-reporting
.
Looking at 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF
over the last hour, I'm not seeing any evidence that turbine shreds are landing at all. Every successful blockstore insertion came via repair. We are seeing some shreds fail to land because the shred already exists, and these could in theory be coming from turbine, but I doubt it.
Reason I doubt this is because you would think at least 1 turbine data shred would make it before repair. Metrics also indicate that zero coding shreds are getting inserted, which makes sense given we don't repair coding shreds.
It also appears that shred_fetch.shred_count is stuck at zero while shred_fetch_repair.shred_count accounts for everything going into blockstore:
Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn
looks similar in that it stopped receiving turbine shreds following restart:
SELECT mean("num_recovered") AS "mean_num_recovered", mean("num_repaired") AS "mean_num_repaired", mean("last_index") AS "mean_last_index" FROM "mainnet-beta"."autogen"."shred_insert_is_full" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn' GROUP BY time(:interval:) FILL(null)
Restart at:
08/26/2023 11:19:25
Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn1.16.9 (src:2c20d87b; feat:3712769919, client:SolanaLabs)
SELECT "cluster_type", "id", "version" FROM "mainnet-beta"."autogen"."validator-new" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn'
Network traffic:
SELECT mean("in_datagrams_delta") AS "mean_in_datagrams_delta", mean("out_datagrams_delta") AS "mean_out_datagrams_delta" FROM "mainnet-beta"."autogen"."net-stats-validator" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn' GROUP BY time(:interval:) FILL(null)
Interestingly after restart we don't see recv errors.
SELECT mean("in_errors_delta") AS "mean_in_errors_delta", mean("rcvbuf_errors_delta") AS "mean_rcvbuf_errors_delta", mean("in_csum_errors_delta") AS "mean_in_csum_errors_delta", mean("no_ports_delta") AS "mean_no_ports_delta", mean("sndbuf_errors") AS "mean_sndbuf_errors" FROM "mainnet-beta"."autogen"."net-stats-validator" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn' GROUP BY time(:interval:) FILL(null)
Raw error count also went down, but not to 0 as would be expected for machine restart. Could this happen with NIC removal?
SELECT mean("in_errors_delta") AS "mean_in_errors_delta", mean("rcvbuf_errors_delta") AS "mean_rcvbuf_errors_delta", mean("in_csum_errors_delta") AS "mean_in_csum_errors_delta", mean("no_ports_delta") AS "mean_no_ports_delta", mean("sndbuf_errors") AS "mean_sndbuf_errors", mean("in_errors") AS "mean_in_errors", mean("rcvbuf_errors") AS "mean_rcvbuf_errors" FROM "mainnet-beta"."autogen"."net-stats-validator" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn' GROUP BY time(:interval:) FILL(null)
dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP
also has repair behavior change around validator restart:
SELECT mean("last_index") AS "mean_last_index", mean("num_recovered") AS "mean_num_recovered", mean("num_repaired") AS "mean_num_repaired" FROM "mainnet-beta"."autogen"."shred_insert_is_full" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP' GROUP BY time(:interval:) FILL(null)
SELECT mean("shred_count") AS "mean_shred_count" FROM "mainnet-beta"."autogen"."shred_fetch" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP' GROUP BY time(:interval:) FILL(null)
SELECT mean("total_requests") AS "mean_total_requests" FROM "mainnet-beta"."autogen"."serve_repair-requests_received" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP' GROUP BY time(:interval:) FILL(null)
time | validator-new.id | validator-new.version |
---|---|---|
08/25/2023 12:49:58 | dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP | 1.16.9 (src:00000000; feat:3712769919, client:SolanaLabs) |
08/28/2023 07:25:03 | dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP | 1.16.9 (src:00000000; feat:3712769919, client:SolanaLabs) |
08/29/2023 04:53:33 | dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP | 1.16.9 (src:00000000; feat:3712769919, client:SolanaLabs) |
08/29/2023 09:58:53 | dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP | 1.14.24 (src:devbuild; feat:1879391783) |
SELECT "id", "version" FROM "mainnet-beta"."autogen"."validator-new" WHERE time > :dashboardTime: AND time < :upperDashboardTime: AND "host_id"='dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP'
Prior to the first restart on 8/25 the validator was receiving some turbine shreds and some incoming repair requests. Following the restart no incoming turbine shreds or repair requests are seen.
Looking at the gossip spy data I see a bunch of above nodes broadcasting localhost as their IP address
IP Address |Age(ms)| Node identifier | Version |Gossip|TPUvote| TPU |TPUfwd| TVU |TVUfwd|Repair|ServeR|ShredVer
------------------+-------+----------------------------------------------+---------+------+-------+------+------+------+------+------+------+--------
109.194.51.129 | 234 | 4oenxERWkYjZfmoTeRPRgXmMF9kkqUzgKSPjxL6gP67u | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.137 | 68 | G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.177 | 546 | AE88LqgCpV7i3GD3JcW9FLGA48N32UsuSagtiPVzHAtH | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.63 | 82 | Eye7jGJ4XypieFsVGBwm8VijbM5qPRJBU7nxGNwpukvT | 1.14.20 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.64 | 1029 | 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.78 | 1216 | dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.88 | 62 | 9xkVy66XvA5VtMdsVHmqkBefoeQh7Hb4sJhMMhfJUULD | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.91 | 66 | EbuSoLTjhTFZbujGCvhtFgvaMbXjAVQPL3bUpGD62D4M | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
127.0.0.1 | 132 | BmjKDyDbPgkt8Pz7F5tr2GNFKqfg7PA9wPJjevcFyJ9Y | 1.14.23 | 9031 | 9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1 | 194 | 77RUN7CWAtQCVpdBcSkNU7jJaKtxy6rCcss2y9QBiMMB | 1.14.24 | 8001 | 9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1 | 292 | 7AVXHqRUU7xZDkQvGikrsTbjBeAYcKHz4ArRfECkz71E | 1.14.23 | 9031 | 9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1 | 653 | 6XkchfsH5LYpPpd9uPDyoKQtYZXPgK3h94nWjJZdwtdc | 1.14.23 | 9031 | 9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1 | 992 | 8c4dZiN5MQQKVRjL3fpMx4cHbuU91XSjWKLTq4w2pneW | 1.14.23 | 9031 | 9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1 | 1086 | Aowy8a3Yih7h4fVuBJ9m4zTSTR5RbutGihVgztkDQLPe | 1.14.24 | 8001 | 9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1 | 1244 | DZ2omywE3SYm2SQXMWwGbcVUTWcjjZYA9RJxq7h25rif | 1.14.23 | 9031 | 9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1 | 2070 | C4vi5tE3tTpBLMheM55t7NRinp9fvMywqaiQhr5YUSy8 | 1.14.24 | 8001 | 9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1 | 2114 | 5MC2bsQWTbKSCG6aZhTTGwCk7kzkoLH5YjgZFkn3vN4c | 1.14.23 | 9031 | 9035 | 9033 | 9034 | 9030 | 9032 | 9037 | 9038 | 56177
127.0.0.1 | 2464 | 53qAF5HupqQMWa6Dy1Rsd6c3FDGrTo1hec5itK4YZGov | 1.14.24 | 8001 | 9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1 | 11082 | EXAG1GJgsQKfML6DwdJbJ4hpotyDYhH1ZGuDhdNZZTn4 | 1.14.24 | 8001 | 9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
127.0.0.1 | 20517 | 3UrNqi8XYh5bh8CE4JF1256fLKX8eenrAXQJ4Y75Do6c | 1.14.24 | 8001 | 9054 | 8003 | 9053 | 9050 | 9051 | 9056 | 9057 | 56177
129.213.40.161 | 178 | 87T8k8EXRPxsTpF8TXGSWVSf4Bheaz4GEy73BPwm2CwH | 1.14.20 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
146.19.24.227 | 0 | 6zAsEYWbtV4skvZGM8Vddd2XYECYatodyxq3asjRyJTY | 1.16.9 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
185.6.13.180 | 8352 | HTTod6mktmDvM5B1fHdyf8uUTaAL9HLTY9MaLRgCDrPd | 1.14.17 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
185.6.13.189 | 2274 | 9XV9hf2oBdJdrcPaLWLALpnCRu9t5Cozd4hJWMZnF4iu | 1.14.17 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
217.110.201.116 | 1514 | Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn | 1.16.9 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
35.75.180.30 | 6 | Fc6NNdS2j3EmrWbU6Uqt6wsKB5ef72NjaWfNxKYbULGD | 1.14.24 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
5.199.170.85 | 537 | CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p | 1.14.24 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
66.45.238.30 | 1508 | 2NCzepxVangimxxDGLBiMYFfPS5ioRKr4aij6Rf668rg | 1.14.23 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
69.166.3.194 | 2010 | HUjU6MUEtpzj8PpLEqFPLGC5VynkWfHkXZpskEfM6GdY | 1.14.26 | 8001 | 8006 | 8004 | 8005 | 8002 | 8003 | 8008 | 8009 | 56177
69.166.3.195 | 1520 | A85mL9EUKnq8r52WpyHMLuZw77bv4buZrbAu8cx8BHjg | 1.14.26 | 8001 | 8006 | 8004 | 8005 | 8002 | 8003 | 8008 | 8009 | 56177
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF
switch between turbine and repair quite frequently without a node restart. I checked some logs and they indeed send repair requests from the same IP address; so very weird why they stop receiving shreds from turbine. Also no net-stats-validator
metric for either of the nodes. :thinking:
IP Address |Age(ms)| Node identifier | Version |Gossip|TPUvote| TPU |TPUfwd| TVU |TVUfwd|Repair|ServeR|ShredVer
109.194.51.137 | 68 | G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
109.194.51.64 | 1029 | 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF | 1.14.24 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
Identity Vote Account Commission Last Vote Root Slot Skip Rate Credits Version Active Stake
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u A2XnPCfYMt8inTUX3DhbVZ6ZW7pmXquLKJ1knGupjaKS 10% 216131323 ( -3) 216131292 ( -3) 0.00% 127398 1.14.24 68646.522416858 SOL (0.02%)
4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF 576M4qCLi4H3yzcvADMMjNA8bDrJuiYZz2XCXMxGvaPe 10% 216131319 ( -7) 216131288 ( -7) 0.00% 127524 1.14.24 67146.939143975 SOL (0.02%)
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF
switch between turbine and repair quite frequently without a node restart. I checked some logs and they indeed send repair requests from the same IP address; so very weird why they stop receiving shreds from turbine. Also no
net-stats-validator
metric for either of the nodes. 🤔
Interesting... throwing out some possibilities:
One interesting thing I'm observing is a bunch of QUIC connection activity at the start of the period where nodes begin receiving turbine shreds:
I believe we're still using UDP for Turbine, right? How about repair?
Node gets dropped from turbine tree somehow
For staked nodes we always include them in the tree even if we don't have their gossip contact-info. So this is unlikely, since these nodes are staked, unless the are staked very recently and not currently yet in the epoch-stakes.
IP being used in turbine is stale/incorrect
From a 3rd validator logs, I can see they are sending repair responses to the same IP address, so seems like the IP address is ok. Might be some issues at the tvu port.
[2023-09-07T14:09:08.895757045Z INFO solana_streamer::streamer] streamer send Repair hosts: count:1000 [(109.194.51.64, SendStats { bytes: 36210, count: 30 }), (34.142.67.255, SendStats { bytes: 35003, count: 29 }), (109.194.51.137, SendStats { bytes: 33796, count: 28 }), (135.181.240.199, SendStats { bytes: 33796, count: 28 }), (109.194.51.78, SendStats { bytes: 32589, count: 27 })]
[2023-09-07T14:15:07.437366168Z INFO solana_streamer::streamer] streamer send Repair hosts: count:1000 [(109.194.51.137, SendStats { bytes: 21726, count: 18 }), (45.250.255.105, SendStats { bytes: 18105, count: 15 }), (208.91.107.14, SendStats { bytes: 18105, count: 15 }), (35.75.180.30, SendStats { bytes: 22933, count: 19 }), (202.8.9.108, SendStats { bytes: 20519, count: 17 })]
I believe we're still using UDP for Turbine, right? How about repair?
yeah, it is all UDP. No quic code on v1.14 or v1.16 branch.
Really interesting data point that I don't know what to make of... The periods that 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF
and G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u
seem to be receiving turbine shreds are correlated with their leader slots. This also explains why we see the QUIC activity jump around these periods.
This means the next time we should see 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF
start receiving turbine shreds again is around slot 216226940
and G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u
is slot 216179224
G9qsqKnvR77FAmcLtcrkMrFvcDGoQ7LQB5xaXc7mD83u 4vX1fgJnjYbYeWJxTDipTUGh93vxxooTUp4Kxvv5KSUF
switch between turbine and repair quite frequently without a node restart.
Looks like both of these nodes are in 59713-RU-Kursk
I sent some random packets to TVU socket and repair socket of below nodes
IP Address |Age(ms)| Node identifier | Version |Gossip|TPUvote| TPU |TPUfwd| TVU |TVUfwd|Repair|ServeR|ShredVer
------------------+-------+----------------------------------------------+---------+------+-------+------+------+------+------+------+------+--------
109.194.51.78 | 225 | dkisqBjujxMvG9HkmWK4aooockCCRcnYZEdR2fmN8TP | 1.14.27 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
129.213.40.161 | 226 | 87T8k8EXRPxsTpF8TXGSWVSf4Bheaz4GEy73BPwm2CwH | 1.14.20 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
185.6.13.180 | 116 | HTTod6mktmDvM5B1fHdyf8uUTaAL9HLTY9MaLRgCDrPd | 1.14.17 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
185.6.13.189 | 3926 | 9XV9hf2oBdJdrcPaLWLALpnCRu9t5Cozd4hJWMZnF4iu | 1.14.17 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
217.110.201.116 | 502 | Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn | 1.14.27 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
35.75.180.30 | 550 | Fc6NNdS2j3EmrWbU6Uqt6wsKB5ef72NjaWfNxKYbULGD | 1.14.24 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
5.199.170.85 | 199 | CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p | 1.14.27 | 8000 | 8005 | 8003 | 8004 | 8001 | 8002 | 8007 | 8008 | 56177
66.45.238.30 | 152 | 2NCzepxVangimxxDGLBiMYFfPS5ioRKr4aij6Rf668rg | 1.14.27 | 8001 | 8005 | 8003 | 8004 | 8000 | 8002 | 8007 | 8008 | 56177
69.166.3.194 | 192 | HUjU6MUEtpzj8PpLEqFPLGC5VynkWfHkXZpskEfM6GdY | 1.14.26 | 8001 | 8006 | 8004 | 8005 | 8002 | 8003 | 8008 | 8009 | 56177
69.166.3.195 | 51 | A85mL9EUKnq8r52WpyHMLuZw77bv4buZrbAu8cx8BHjg | 1.14.26 | 8001 | 8006 | 8004 | 8005 | 8002 | 8003 | 8008 | 8009 | 56177
and none show up in the metrics. Tested two other good nodes as the control group, and those indeed showed clear spike in the metrics.
Quite odd that even the metrics from their repair socket do not show a spike from the random packets, even though they are able to receive repair responses on that socket.
Quite odd that even the metrics from their repair socket do not show a spike from the random packets, even though they are able to receive repair responses on that socket.
This could be due to NAT: incoming packets from remote address are only permitted following outgoing packet to that address.
Two of the nodes:
CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p
Bxps8LSWxZVx618iK8rHvnCD5f9u7cTYuXz5sLsAGuSn
have identified issues in their firewall configuration and have updated the firewall to open respective ports/protocols.
This is CmGiehaqWfEZwyhmN9rckNtnVMeZtsjz1obusNWGyj4p
.
We realized that we hadn't updated our firewall config to open the following ports:
8000-8020/udp
8001/tcp
8000/tcp
Apologies for the trouble.
Problem
Based on
shred_insert_is_full
metric from mainnet-beta, it seems like ~1% of nodes are repairing almost every shred in every slot.Proposed Solution
ContactInfo
is not propagated in gossip, but why?In the time interval
(7907 slots), the offending nodes seem to be:
Note that majority of above nodes are staked, in particular
cc @jbiseda @bw-solana