solana-labs / solana

Web-Scale Blockchain for fast, secure, scalable, decentralized apps and marketplaces.
https://solanalabs.com
Apache License 2.0
13.02k stars 4.19k forks source link

Dropped vote from mismatch hash #11008

Closed sakridge closed 4 years ago

sakridge commented 4 years ago

Problem

On tds, dropped votes seen from hash mismatch:

Jul 11 16:54:51 sol-testnet-01 solana-validator[37126]: [2020-07-11T16:54:51.646392973Z WARN  solana_vote_program::vote_state] Cssyq8cRFBHc9VQFLVPR3Awehyb7YNiMW1U7PeREbr5H dropped vote Vote { slots: [22481541], hash: 8j4WXKN56TLAcCU4u5WhtRHasfjzcs8N3TCpV2WBLU1p, timestamp: None } failed to match hash 8j4WXKN56TLAcCU4u5WhtRHasfjzcs8N3TCpV2WBLU1p 414B3Ji7EgFKXfSwpSYcS42TFoiyjkzRDCAhR41pnMnm

Likely due to async accounts clean.

Proposed Solution

Find likely race condition in accounts_db

sakridge commented 4 years ago

dropped-transactions

Hmm... I see some weird spots in the past few days.

mvines commented 4 years ago

Hey @ryoqun, can you please investigate this a little.

ryoqun commented 4 years ago

Investigation result:

It seems that this isn't our software bug of any race conditions, but a particular validator's operational issue.

I'm contacting the validator on discord.

Observations:

A validator known as identity key FdC2FZ4geXeEh4pTPXWrNH4nqsriB3xD3GME9HPD8CVR (discord handle: trap098#0032) periodically submits vote transactions with bad hashes. The bad hashes divergent against the whole rest of the cluster. So the validator is the only single validator which is voting on the different ledger state. Periodically here means the validator votes correct hashes most of other time for some reason. Once mismatching initiates, this continues around 30-40 minutes, observed across the cluster (see charts). Also, this is happening at some round-numbered wall-clocks. So, this is maybe caused by some scheduled (cron?) validator management practised by the validator's operator? or just attempted attacks? :) Also, this is happening for quite some time (at least, I did observed these error patterns 7/15 and onwards).

For example, last 3 instances of mismatched votes were like these:

1:

Our validator for RPC/API service observed this: 2020-07-22T00:31-2020-07-22T00:32 (UTC)

Cluster-wide mismatched vote count: Screenshot from 2020-07-22 15-56-17 (JST)

2:

Our validator for RPC/API service observed this: 2020-07-21T17:21-2020-07-21T17:24 (UTC)

Cluster-wide mismatched vote count: Screenshot from 2020-07-22 15-57-27 (JST)

3:

Our validator for RPC/API service observed this: 2020-07-21T12:01-2020-07-21T12:03 (UTC)

Cluster-wide mismatched vote count: Screenshot from 2020-07-22 15-57-58 (JST)

And, there was no other obvious symptoms related to mismatched vote hashes other than the above. So, I'm concluding there is currently no known race conditions on v1.2/testnet(tds).

ryoqun commented 4 years ago

@sakridge @mvines : fyi ^^

atomlab commented 4 years ago

This is my key FdC2FZ4geXeEh4pTPXWrNH4nqsriB3xD3GME9HPD8CVR . How can I help?

Once mismatching initiates, this continues around 30-40 minutes, observed across the cluster (see charts). Also, this is happening at some round-numbered wall-clocks. So, this is maybe caused by some scheduled (cron?) validator management practised by the validator's operator? or just attempted attacks? :)

I have cron job which restart validator when it behind.

2020-07-22T00:31-2020-07-22T00:32 (UTC) 2020-07-21T17:21-2020-07-21T17:24 (UTC) 2020-07-21T12:01-2020-07-21T12:03 (UTC)

Logs solana-validator.log.gz Jul 21 23:00:00 - Jul 22 02:00:00 (UTC) (78M) solana-validator.log.gz Jul 21 16:00:00 - Jul 21 18:00:00 (UTC) (65M) solana-validator.log.gz Jul 21 11:00:00 - Jul 21 13:00:00 (UTC) (60M)

Metrics Screenshot 2020-07-22 at 11 53 44

Screenshot 2020-07-22 at 12 16 08

Schedule task

/etc/systemd/system/solana_restart_hook.service

[Unit]
Description=Solana restart hook

[Service]
ExecStart=/home/solana/solana_slots_per_10_min.py

[Install]
WantedBy=default.target

/etc/systemd/system/solana_restart_hook.timer

[Unit]
Description=Timer for solana_restart_hook

[Timer]
OnBootSec=30min
OnUnitActiveSec=30min
OnCalendar=*:0/30

[Install]
WantedBy=timers.target

Script

#!/usr/bin/python3

import requests as r
import json
import subprocess as sub

payload = {'query': 'delta(solana_validator_root_slot{pubkey="FqpvstKqfYi9wcm68Cctwnb5bu9C1agPoRAJwUQAUs83"}[10m])'}
try:
    req = r.get("http://prometheus:8428/api/v1/query_range",params=payload)
except Exception as err:
    print(err)
    exit(1)

try:
    result = req.json()['data']['result'][0]['values'][0][1]
    print(result)
except Exception as err:
    print(err)
    exit(1)

if int(result) < 1:
    print("Slots/10m=0")
    print("solana-validator restart")
    sub.check_output(['systemctl', 'restart', 'solana-validator'])
    exit(1)

exit(0)
ryoqun commented 4 years ago

@atomlab Thanks for providing various info. I'm now certain why your validator gets delinquent sometimes.

Jul 22 00:31:08 sol-testnet-01 solana-validator[7429]: [2020-07-22T00:31:08.183230894Z WARN  solana_runtime::native_loader] Failed to load: DlOpen { desc: "/usr/local/bin/deps/libsolana_bpf_loader_program.so: cannot open shared object file: No such file or directory" }

This is the root cause of mismatched vote hash errors only for you.

This means your validator fails to load some required library files and continues to process transactions nevertheless. Then, the validator results in a different transaction execution outcome because it's functioning with partially disabled functionality, compared to others.

The reason this does occur intermittently is that most of transactions doesn't need the missing library. When restarting from a different newer snapshot, the validator doesn't see the certain transactions (problematic only for you) anymore and doesn't need process them and resume its service. After awhile, when newer problematic transaction is submitted to the testnet, your validator then fails to process them with mismatched vote hashes and start to go delinquent.

How did you install solana? I'm not sure but solana install should generally place various files at correct paths... (CC: @mvines @danpaul000)

Well, I've stumbled on this personally (dlopen failed and never-stopping validator with mismatched vote hashes).

@jackcmay How about making this condition panic! and checks all required files are loadable at validator's start-up? Because this causes mismatched vote hashes, I think this is certainly not something for validators to continue to run anymore with merely WARN messages.

CriesofCarrots commented 4 years ago

@t-nelson , the dropped vote testnet.solana.com logs you posted (https://discord.com/channels/428295358100013066/560503042458517505/735687533643563122) to Discord look similar. Related?

atomlab commented 4 years ago

@ryoqun

How did you install solana?

Just copy binary files

tar -xvf solana-release-x86_64-unknown-linux-gnu.tar.bz2
cp solana-release/bin/solana* /usr/local/bin/

UPDATE:

I have just copied lib

mkdir  /usr/local/bin/deps
cp ./solana-release/bin/deps/libsolana_bpf_loader_program.so /usr/local/bin/deps/

and solana-validator have restarted

mvines commented 4 years ago

@jackcmay How about making this condition panic! and checks all required files are loadable at validator's start-up? Because this causes mismatched vote hashes, I think this is certainly not something for validators to continue to run anymore with merely WARN messages.

Yes I agree @ryoqun. That deserves to be a panic! as there's no way the validator process will be able to operate correctly

t-nelson commented 4 years ago

@t-nelson , the dropped vote testnet.solana.com logs you posted to Discord look similar. Related?

Yep! :100:

@ryoqun testnet.solana.com is currently dead with these flooding the log from several (all?) validators. I'm just digging into it now. Will report any findings here

jackcmay commented 4 years ago

Yeah, panic! sounds good. I also have an outstanding change mainline the bpf loader but hung up on a few points

ryoqun commented 4 years ago

@atomlab It looks like there is still some dropped vote errors coming from your validator, perhaps?

Can you check your metrics chart for dropped votes and recent log files for messages like this?:

Jul 22 00:31:08 sol-testnet-01 solana-validator[7429]: [2020-07-22T00:31:08.183230894Z WARN solana_runtime::native_loader] Failed to load: DlOpen { desc: "/usr/local/bin/deps/libsolana_bpf_loader_program.so: cannot open shared object file: No such file or directory" }

atomlab commented 4 years ago

@atomlab It looks like there is still some dropped vote errors coming from your validator, perhaps?

Can you check your metrics chart for dropped votes and recent log files for messages like this?:

Jul 22 00:31:08 sol-testnet-01 solana-validator[7429]: [2020-07-22T00:31:08.183230894Z WARN solana_runtime::native_loader] Failed to load: DlOpen { desc: "/usr/local/bin/deps/libsolana_bpf_loader_program.so: cannot open shared object file: No such file or directory" }

Last five days my node have worked without any problem with delinquent and restarting (only once when my node was updating to 1.2.13 ).

Last two days grep logs. I don't see any errors connected with deps libs.

root@sol-testnet-01:~# grep libsolana /var/log/syslog*
root@sol-testnet-01:~#
root@sol-testnet-01:~# grep "Failed to load" /var/log/syslog*
root@sol-testnet-01:~#
root@sol-testnet-01:~# grep "dropped vote Vote"  /var/log/syslog* | wc -l
354
root@sol-testnet-01:~#

I can share whole logs for last two or three days.

Files and libs

root@sol-testnet-01:~# tree /usr/local/bin/
/usr/local/bin/
├── deps
│   ├── libsolana_bpf_loader_program.rlib
│   ├── libsolana_bpf_loader_program.so
│   ├── libsolana_btc_spv_program.rlib
│   ├── libsolana_btc_spv_program.so
│   ├── libsolana_budget_program.rlib
│   ├── libsolana_budget_program.so
│   ├── libsolana_config_program.rlib
│   ├── libsolana_config_program.so
│   ├── libsolana_exchange_program.rlib
│   ├── libsolana_exchange_program.so
│   ├── libsolana_failure_program.rlib
│   ├── libsolana_failure_program.so
│   ├── libsolana_noop_program.rlib
│   ├── libsolana_noop_program.so
│   ├── libsolana_stake_program.rlib
│   ├── libsolana_stake_program.so
│   ├── libsolana_storage_program.rlib
│   ├── libsolana_storage_program.so
│   ├── libsolana_vote_program.rlib
│   └── libsolana_vote_program.so
├── loki-canary
├── node_exporter
├── promtail
├── solana
├── solana-bench-exchange
├── solana-bench-tps
├── solana-dos
├── solana_exporter
├── solana-faucet
├── solana-genesis
├── solana-gossip
├── solana-install
├── solana-install-init
├── solana-keygen
├── solana-ledger-tool
├── solana-log-analyzer
├── solana-net-shaper
├── solana-stake-accounts
├── solana-stake-monitor
├── solana-stake-o-matic
├── solana-sys-tuner
├── solana-tokens
├── solana-validator
└── solana-watchtower

1 directory, 45 files
root@sol-testnet-01:/opt# ldd /usr/local/bin/solana-validator
    linux-vdso.so.1 (0x00007ffda88bd000)
    libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ffb43b1c000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffb43918000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffb436f9000)
    libssl.so.1.1 => /usr/lib/x86_64-linux-gnu/libssl.so.1.1 (0x00007ffb4346c000)
    libcrypto.so.1.1 => /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1 (0x00007ffb42fa1000)
    libudev.so.1 => /lib/x86_64-linux-gnu/libudev.so.1 (0x00007ffb42d83000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffb42992000)
    /lib64/ld-linux-x86-64.so.2 (0x00007ffb45762000)
    librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ffb4278a000)
    libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ffb42572000)
    libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ffb421d4000)

Last three days

Screenshot 2020-07-28 at 13 02 19

ryoqun commented 4 years ago

@atomlab Great thanks for such a detailed report. Hmm, then I think there should be no problem at your side.. I'll take a further look. Maybe, I'm missing something...

ryoqun commented 4 years ago

Hhm, the log at our api node is rather cluttered as it seems there are other validators which are tripped on dlopen failures...

The only remaining unaddressed mismatched vote log burst begins with:

[2020-07-26T13:17:29.265278405Z WARN  solana_vote_program::vote_state] Db2V7nPHc4sPHne87nYXPGn8Kv8rMsiWCAjgAXmpqcpC dropped vote Vote { slots: [26012256, 26012257, 26012258, 26012259], hash: AKRQKQ4qyAveV>
[2020-07-26T13:17:29.265279686Z WARN  solana_vote_program::vote_state] B9gJJ4vMLJvnb5geZjU9PqhkyHX4jESMYajfcALQgRry dropped vote Vote { slots: [26012256, 26012257, 26012258, 26012259], hash: AKRQKQ4qyAveV>
[2020-07-26T13:17:29.265720730Z WARN  solana_vote_program::vote_state] 2D1oCLRK6geGhV5RyZ52JD9Qzqt311AEH1XrTjZdzbRh dropped vote Vote { slots: [26012256, 26012257, 26012258, 26012259], hash: AKRQKQ4qyAveV>
[2020-07-26T13:17:29.265625259Z WARN  solana_vote_program::vote_state] HfxzFiP19ymtxHagP4Zpga2zVo6ZgivpK6VkBKDowHRr dropped vote Vote { slots: [26012256, 26012257, 26012258, 26012259], hash: AKRQKQ4qyAveV>
[2020-07-26T13:17:29.265596073Z WARN  solana_vote_program::vote_state] Ft5fbkqNa76vnsjYNwjDZUXoTWpP7VYm3mtsaQckQADN dropped vote Vote { slots: [26012256, 26012257, 26012258, 26012259], hash: AKRQKQ4qyAveV>

UPDATE: Well, these are probably due to this: https://discord.com/channels/428295358100013066/670512312339398668/736939913697558629

 Víctor | melea: in this new epoch , validator not update the last, are delinquents
[10:37 PM]
Víctor | melea: Just in case :trumpet:
[10:38 PM]
Víctor | melea: skipped slots is crazy now
ryoqun commented 4 years ago

Also, indeed I couldn't see no mismatched votes coming from FdC2FZ4geXeEh4pTPXWrNH4nqsriB3xD3GME9HPD8CVR. I just misread the metrics...

@t-nelson , the dropped vote testnet.solana.com logs you posted to Discord look similar. Related?

Yep! 100

@ryoqun testnet.solana.com is currently dead with these flooding the log from several (all?) validators. I'm just digging into it now. Will report any findings here

@t-nelson Did you find something?

If there is none, I think we can close this for now; there is no unidentified bugs for the consensus at the moment.

ryoqun commented 4 years ago

failed-match-hash2.log

For the record, I'm leaving the log, which justified my above conclusion. :)

ryoqun commented 4 years ago

FYI @mvines (for status update)

t-nelson commented 4 years ago

@ryoqun nothing stood out to me. I've snipped out some log (10k lines before and 1k lines after the last root) from the original, which is 3.3GB :astonished:. We also still have the ledger at ~/ledger-old-20200723 if that's of interest

ryoqun commented 4 years ago

@ryoqun nothing stood out to me. I've snipped out some log (10k lines before and 1k lines after the last root) from the original, which is 3.3GB :astonished:. We also still have the ledger at ~/ledger-old-20200723 if that's of interest

@t-nelson I gave it a shot. In short, I spotted the reason of vote mismatched on 7/22-23 on t.s.c and that I'm sure there is no need to address it. So. I'm closing this. For investigation, I've compared t.s.c with api node:

t.s.c node:

    216.24.140.155  me|  5311 | 5D1fNXzvv5NjV1ysLjirC4WY92RNsVH18vjmcszZd8on |1.2.12 24dc6680| 8001 | 8004 | 8005 | 8002 | 8003 | 8007 | 8008 | 8899 | 8900 | 1579

api node:

    34.105.44.244   me|  2283 | ta1Uvfb7W5BRPrdGnhP9RmeCGKzBySGM1hTE4rBRy6T  |1.2.12 bc47c806| 8001 | 8004 | 8005 | 8002 | 8003 | 8007 | 8008 | 8899 | 8900 | 1579

The ultimate reason is that even though inflation was intended to be disabled #11072 , it was effectively enabled with buggy inflation implementation for majority of nodes while it's not for t.s.c. t.s.c was running with correct inflation implementation, while disabling the inflation.

In other words, the correct behavior is boycotted by the rest, ironically. ;)

That's because it was running slightly newer and correct code than the rest. Now, we're all running the same and newer code. So, there should be no problem both for future testnet and future mainnet-beta.

More details:

At the very start of new epoch (72) from (71), the t.s.c diverged from the rest:

DIVERGED (ignored minority; t.s.c, inflation is disabed; note that few updated_accounts and no capitalization increase):

[2020-07-22T21:15:18.059112578Z INFO  solana_runtime::bank] bank frozen: 25148255 hash: 2nFXrbJ9fxxUv1NvZKqBm1Aq7Zrnzvi69voomqxCutMZ accounts_delta: 8qM7FENZDSkf3VxV8jT3EFke7CPQ4AAEF7eNZBfJX26S signature_count: 210 last_blockhash: Fs9cmakU9RMNcKH7xz5hYeQnTsDbap42H6x74r5oRF8u capitalization: 2494293386742964911
[2020-07-22T21:15:18.059140662Z INFO  solana_runtime::bank] accounts hash slot: 25148255 stats: BankHashStats { num_updated_accounts: 356, num_removed_accounts: 3, num_lamports_stored: 11664467026693169154, total_data_len: 797514, num_executable_accounts: 0 }
[2020-07-22T21:15:24.074462036Z INFO  solana_runtime::bank] bank frozen: 25148256 hash: DnjWty7TUWCwVzbz3iVudn15aaiCWmaj13D1rzr9rFSd accounts_delta: 91c62xXccDQS8MuFuZYXq9EHWct6GcSjpLGUrQKuzHPe signature_count: 0 last_blockhash: 2hcBxKuueVCRfsrVXq51rFNUFdYqonmNtESLLUuQyCug capitalization: 2494293386742964911
[2020-07-22T21:15:24.074494968Z INFO  solana_runtime::bank] accounts hash slot: 25148256 stats: BankHashStats { num_updated_accounts: 8, num_removed_accounts: 0, num_lamports_stored: 8, total_data_len: 174070, num_executable_accounts: 1 }

NOT DIVERGED (overwheling majority; including api node, buggy inflation is effectively enabled; note that large updated_accounts and large capitalization increase):

[2020-07-22T21:15:18.732640518Z INFO  solana_runtime::bank] bank frozen: 25148255 hash: 2nFXrbJ9fxxUv1NvZKqBm1Aq7Zrnzvi69voomqxCutMZ accounts_delta: 8qM7FENZDSkf3VxV8jT3EFke7CPQ4AAEF7eNZBfJX26S signature_count: 210 last_blockhash: Fs9cmakU9RMNcKH7xz5hYeQnTsDbap42H6x74r5oRF8u capitalization: 2494293386742964911
[2020-07-22T21:15:18.732670718Z INFO  solana_runtime::bank] accounts hash slot: 25148255 stats: BankHashStats { num_updated_accounts: 356, num_removed_accounts: 3, num_lamports_stored: 11664467026693169154, total_data_len: 797514, num_executable_accounts: 0 }
[2020-07-22T21:15:28.303396046Z INFO  solana_runtime::bank] bank frozen: 25148256 hash: ETHkQiAky8NNvwAjJmQHnMic3u1SdXLsgACt7BGVfbZX accounts_delta: AEdpCnMDWxi3QCjCX5qwDc9kkCTZJnYHAFy6Pzscs6yC signature_count: 0 last_blockhash: 2hcBxKuueVCRfsrVXq51rFNUFdYqonmNtESLLUuQyCug capitalization: 3715221769377918309
[2020-07-22T21:15:28.303456364Z INFO  solana_runtime::bank] accounts hash slot: 25148256 stats: BankHashStats { num_updated_accounts: 156954, num_removed_accounts: 0, num_lamports_stored: 7482325777811518660, total_data_len: 318271140, num_executable_accounts: 1 }

So, why inflation wasn't disabled correctly?:

The culprit is here: https://github.com/solana-labs/solana/blob/14baa511f004c18311fb4c806cfaa28d47048fd4/runtime/src/bank.rs#L793-L795

When inflation become disabled, original validator_point_value is always 0, which is .is_normal() => true. So, sysvar's (= rewards in the code) validator_point_value is used instead. And it usually holds non-zero value for inflation-enabled clusters. So, in other words, this was a bit footgun.. fortunatelly, we no longer uses the sysvar anymore (note the comment):

https://github.com/solana-labs/solana/blob/e07c00710a69ae69d60ae539339eb0f53e950452/runtime/src/bank.rs#L703-L709

t-nelson commented 4 years ago

@ryoqun Thanks for the analysis! Perhaps too quick to close though :upside_down_face:

I think this is what halted my v1.1 -> v1.2 rolling upgrade test. I'm still investigating but the cluster just so happened to cross an epoch boundary with ~50% of the validators on both v1.1 and v1.2. Each half stopped making roots in short order, rejecting each others votes on mismatched hashes. I'm going to check the rewards sysvar from each and :crossed_fingers: there's a delta

UPDATE:

Validator

SysvarRewards111111111111111111111111111111:
  - balance: 0.000000001 SOL
  - owner: 'Sysvar1111111111111111111111111111111111111'
  - executable: false
  - data: 'NMgTXzDaJA21JJEHjeJij1'
  - data_len: 16

BSV:

SysvarRewards111111111111111111111111111111:
  - balance: 0.000000001 SOL
  - owner: 'Sysvar1111111111111111111111111111111111111'
  - executable: false
  - data: 'NGQq1TYA83r5aEHD8dyZEo'
  - data_len: 16

:tada: :tada: :tada:

Gonna rerun the test with longer epochs. Should do the trick

UPDATE 2:

It worked! :trumpet: