near / nearcore-support

This repository is used to keep track of nearcore issues reported by validator community and infrastructure partners
1 stars 0 forks source link

[Node Issue]: Garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111 #9

Open MichaelLLC opened 2 weeks ago

MichaelLLC commented 2 weeks ago

Contact Details

michael@impulse.expert

What happened?

I downloaded a snapshot for the archive node. Launched the archive node. There are errors in the node logs:

WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111

Version

2.1.1 without SHA-NI

Node type

Split Storage Archival

Are you a validator?

Relevant log output

2024-08-27T22:27:53.471265Z  INFO neard: version="trunk" build="2.1.1-no-shani" latest_protocol=71
2024-08-27T22:27:53.471655Z  INFO config: Validating Config, extracted from config.json...
2024-08-27T22:27:53.473934Z  WARN genesis: Skipped genesis validation
2024-08-27T22:27:53.473958Z  INFO config: Validating Genesis config and records. This could take a few minutes...
2024-08-27T22:27:53.474366Z  INFO config: All validations have passed!
2024-08-27T22:27:53.477098Z  INFO neard: Changing the config "/root/.near/log_config.json". config=LogConfig { rust_log: None, verbose_module: None, opentelemetry: None }
2024-08-27T22:27:53.477253Z  INFO config: Validating Config, extracted from config.json...
2024-08-27T22:27:53.477272Z  INFO neard: No validator key /root/.near/validator_key.json.
2024-08-27T22:27:53.477327Z  INFO near_o11y::reload: Updated the logging layer according to `log_config.json`
2024-08-27T22:27:53.477355Z  INFO db_opener: Opening NodeStorage path="/root/.near/hot-data" cold_path="/root/.near/cold-data"
2024-08-27T22:27:53.477382Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-08-27T22:27:53.495098Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-08-27T22:27:53.495124Z  INFO db_opener: The database exists. path=/root/.near/hot-data
2024-08-27T22:27:53.495136Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-08-27T22:27:53.548265Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-08-27T22:27:53.548293Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-08-27T22:27:53.551772Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-08-27T22:27:53.551788Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-08-27T22:27:53.554771Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-08-27T22:27:53.554779Z  INFO db_opener: The database exists. path=/root/.near/cold-data
2024-08-27T22:27:53.554784Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-08-27T22:27:53.650782Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-08-27T22:27:53.650808Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-08-27T22:27:53.654321Z  INFO db: Closed a RocksDB instance. num_instances=0
2024-08-27T22:27:53.654332Z  INFO db: Opened a new RocksDB instance. num_instances=1
2024-08-27T22:27:53.697501Z  INFO db: Opened a new RocksDB instance. num_instances=2
2024-08-27T22:27:53.824520Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:27:53.841823Z  INFO config: Mutable config field 'expected_shutdown' remains the same: None
2024-08-27T22:27:53.841836Z  INFO config: Mutable config field 'resharding_config' remains the same: ReshardingConfig { batch_size: 500.0 KB, batch_delay: Duration { seconds: 0, nanoseconds: 100000000 }, retry_delay: Duration { seconds: 10, nanoseconds: 0 }, initial_delay: Duration { seconds: 0, nanoseconds: 0 }, max_poll_time: Duration { seconds: 7200, nanoseconds: 0 } }
2024-08-27T22:27:53.841843Z  INFO config: Mutable config field 'produce_chunk_add_transactions_time_limit' remains the same: None
2024-08-27T22:27:53.841848Z  INFO config: Updated ClientConfig
2024-08-27T22:27:53.844214Z  INFO stats: # 9820210 Waiting for peers 0 peers ⬇ 0 B/s ⬆ 0 B/s NaN bps 0 gas/s CPU: 0%, Mem: 240 MB
2024-08-27T22:27:54.825946Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:27:55.827307Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:27:56.828598Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:27:57.829930Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:27:58.831306Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:27:59.831536Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:28:00.832790Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:28:01.834087Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:28:02.835408Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:28:03.836656Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:28:03.846729Z  INFO stats: # 9820210 Waiting for peers 0 peers ⬇ 0 B/s ⬆ 0 B/s 0.00 bps 0 gas/s CPU: 2%, Mem: 284 MB
2024-08-27T22:28:04.837943Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111
2024-08-27T22:28:05.839184Z  WARN garbage collection: Error in gc: DB Not Found Error: BLOCK HEADER: 11111111111111111111111111111111

Node head info

version="trunk" build="2.1.1-no-shani" latest_protocol=71

Node upgrade history

I started working immediately with version 2.1.1

DB reset history

I restarted the node as soon as I saw a large list of errors.
telezhnaya commented 2 weeks ago

I downloaded a snapshot for the archive node.

How have you done this? Which instruction were you following Have you downloaded genesis config as well?

MichaelLLC commented 2 weeks ago

I have deployed an archived near node based on information from here:

https://near-nodes.io/rpc/run-rpc-node-without-nearup#mainnet
https://near-nodes.io/archival/split-storage-archival
MichaelLLC commented 2 weeks ago

This is my the node config:

{
  "genesis_file": "genesis.json",
  "genesis_records_file": null,
  "validator_key_file": "validator_key.json",
  "node_key_file": "node_key.json",
  "rpc": {
    "addr": "0.0.0.0:15030",
    "prometheus_addr": null,
    "cors_allowed_origins": [
      "*"
    ],
    "polling_config": {
      "polling_interval": {
        "secs": 0,
        "nanos": 500000000
      },
      "polling_timeout": {
        "secs": 10,
        "nanos": 0
      }
    },
    "limits_config": {
      "json_payload_max_size": 10485760
    },
    "enable_debug_rpc": false,
    "experimental_debug_pages_src_path": null
  },
  "telemetry": {
    "endpoints": [
      "https://explorer.mainnet.near.org/api/nodes",
      "https://telemetry.nearone.org/nodes/mainnet"
    ],
    "reporting_interval": {
      "secs": 10,
      "nanos": 0
    }
  },
  "network": {
    "addr": "0.0.0.0:24567",
    "boot_nodes": "ed25519:86EtEy7epneKyrcJwSWP7zsisTkfDRH5CFVszt4qiQYw@35.195.32.249:24567,ed25519:BFB78VTDBBfCY4jCP99zWxhXUcFAZqR22oSx2KEr8UM1@35.229.222.235:24567,ed25519:Cw1YyiX9cybvz3yZcbYdG7oDV6D7Eihdfc8eM1e1KKoh@35.195.27.104:24567,ed25519:33g3PZRdDvzdRpRpFRZLyscJdbMxUA3j3Rf2ktSYwwF8@34.94.132.112:24567,ed25519:CDQFcD9bHUWdc31rDfRi4ZrJczxg8derCzybcac142tK@35.196.209.192:24567",
    "whitelist_nodes": "",
    "max_num_peers": 40,
    "minimum_outbound_peers": 5,
    "ideal_connections_lo": 30,
    "ideal_connections_hi": 35,
    "peer_recent_time_window": {
      "secs": 600,
      "nanos": 0
    },
    "safe_set_size": 20,
    "archival_peer_connections_lower_bound": 10,
    "handshake_timeout": {
      "secs": 20,
      "nanos": 0
    },
    "skip_sync_wait": false,
    "ban_window": {
      "secs": 10800,
      "nanos": 0
    },
    "blacklist": [],
    "ttl_account_id_router": {
      "secs": 3600,
      "nanos": 0
    },
    "peer_stats_period": {
      "secs": 5,
      "nanos": 0
    },
    "monitor_peers_max_period": {
      "secs": 60,
      "nanos": 0
    },
    "peer_states_cache_size": 1000,
    "peer_expiration_duration": {
      "secs": 604800,
      "nanos": 0
    },
    "public_addrs": [],
    "allow_private_ip_in_public_addrs": false,
    "trusted_stun_servers": [
      "stun.l.google.com:19302",
      "stun1.l.google.com:19302",
      "stun2.l.google.com:19302",
      "stun3.l.google.com:19302",
      "stun4.l.google.com:19302"
    ],
    "experimental": {
      "inbound_disabled": false,
      "connect_only_to_boot_nodes": false,
      "skip_sending_tombstones_seconds": 0,
      "tier1_enable_inbound": true,
      "tier1_enable_outbound": true,
      "tier1_connect_interval": {
        "secs": 60,
        "nanos": 0
      },
      "tier1_new_connections_per_attempt": 50
    }
  },
  "consensus": {
    "min_num_peers": 3,
    "block_production_tracking_delay": {
      "secs": 0,
      "nanos": 100000000
    },
    "min_block_production_delay": {
      "secs": 1,
      "nanos": 300000000
    },
    "max_block_production_delay": {
      "secs": 3,
      "nanos": 0
    },
    "max_block_wait_delay": {
      "secs": 6,
      "nanos": 0
    },
    "produce_empty_blocks": true,
    "block_fetch_horizon": 50,
    "block_header_fetch_horizon": 50,
    "catchup_step_period": {
      "secs": 0,
      "nanos": 100000000
    },
    "chunk_request_retry_period": {
      "secs": 0,
      "nanos": 400000000
    },
    "header_sync_initial_timeout": {
      "secs": 10,
      "nanos": 0
    },
    "header_sync_progress_timeout": {
      "secs": 2,
      "nanos": 0
    },
    "header_sync_stall_ban_timeout": {
      "secs": 120,
      "nanos": 0
    },
    "state_sync_timeout": {
      "secs": 60,
      "nanos": 0
    },
    "header_sync_expected_height_per_second": 10,
    "sync_check_period": {
      "secs": 10,
      "nanos": 0
    },
    "sync_step_period": {
      "secs": 0,
      "nanos": 10000000
    },
    "doomslug_step_period": {
      "secs": 0,
      "nanos": 100000000
    },
    "sync_height_threshold": 1
  },
  "tracked_accounts": [],
  "tracked_shards": [0],
  "log_summary_style": "colored",
  "log_summary_period": {
    "secs": 10,
    "nanos": 0
  },
  "enable_multiline_logging": false,
  "gc_blocks_limit": 2,
  "gc_fork_clean_step": 100,
  "gc_num_epochs_to_keep": 5,
  "view_client_threads": 4,
  "epoch_sync_enabled": false,
  "view_client_throttle_period": {
    "secs": 30,
    "nanos": 0
  },
  "trie_viewer_state_size_limit": 50000,
  "store": {
    "path": "hot-data",
    "enable_statistics": false,
    "enable_statistics_export": true,
    "max_open_files": 10000,
    "col_state_cache_size": 3221225472,
    "col_flat_state_cache_size": 134217728,
    "block_size": 16384,
    "trie_cache": {
      "default_max_bytes": 500000000,
      "per_shard_max_bytes": {
        "s1.v1": 50000000,
        "s3.v1": 3000000000,
        "s1.v2": 50000000,
        "s2.v2": 3000000000,
        "s4.v2": 3000000000,
        "s1.v3": 50000000,
        "s2.v3": 1500000000,
        "s3.v3": 1500000000,
        "s5.v3": 3000000000
      },
      "shard_cache_deletions_queue_capacity": 100000
    },
    "view_trie_cache": {
      "default_max_bytes": 50000000,
      "per_shard_max_bytes": {},
      "shard_cache_deletions_queue_capacity": 100000
    },
    "enable_receipt_prefetching": true,
    "sweat_prefetch_receivers": [
      "token.sweat",
      "vfinal.token.sweat.testnet"
    ],
    "sweat_prefetch_senders": [
      "oracle.sweat",
      "sweat_the_oracle.testnet"
    ],
    "claim_sweat_prefetch_config": [
      {
        "receiver": "claim.sweat",
        "sender": "token.sweat",
        "method_name": "record_batch_for_hold"
      },
      {
        "receiver": "claim.sweat",
        "sender": "",
        "method_name": "claim"
      }
    ],
    "load_mem_tries_for_shards": [],
    "load_mem_tries_for_tracked_shards": false,
    "state_snapshot_config": {
      "state_snapshot_type": "ForReshardingOnly"
    },
    "state_snapshot_enabled": false
  },
  "state_sync_enabled": true,
  "state_sync": {
    "sync": {
      "ExternalStorage": {
        "location": {
          "GCS": {
            "bucket": "state-parts"
          }
        },
        "num_concurrent_requests": 25,
        "num_concurrent_requests_during_catchup": 5
      }
    }
  },
  "transaction_pool_size_limit": 100000000,
  "archive": true,
  "save_trie_changes": true,
  "cold_store": {
    "path": "cold-data",
    "enable_statistics": false,
    "enable_statistics_export": true,
    "max_open_files": 10000,
    "col_state_cache_size": 3221225472,
    "col_flat_state_cache_size": 134217728,
    "block_size": 16384,
    "trie_cache": {
      "default_max_bytes": 500000000,
      "per_shard_max_bytes": {
        "s1.v1": 50000000,
        "s3.v1": 3000000000,
        "s1.v2": 50000000,
        "s2.v2": 3000000000,
        "s4.v2": 3000000000,
        "s1.v3": 50000000,
        "s2.v3": 1500000000,
        "s3.v3": 1500000000,
        "s5.v3": 3000000000
      },
      "shard_cache_deletions_queue_capacity": 100000
    },
    "view_trie_cache": {
      "default_max_bytes": 50000000,
      "per_shard_max_bytes": {},
      "shard_cache_deletions_queue_capacity": 100000
    },
    "enable_receipt_prefetching": true,
    "sweat_prefetch_receivers": [
      "token.sweat",
      "vfinal.token.sweat.testnet"
    ],
    "sweat_prefetch_senders": [
      "oracle.sweat",
      "sweat_the_oracle.testnet"
    ],
    "claim_sweat_prefetch_config": [
      {
        "receiver": "claim.sweat",
        "sender": "token.sweat",
        "method_name": "record_batch_for_hold"
      },
      {
        "receiver": "claim.sweat",
        "sender": "",
        "method_name": "claim"
      }
    ],
    "load_mem_tries_for_shards": [],
    "load_mem_tries_for_tracked_shards": false,
    "state_snapshot_config": {
      "state_snapshot_type": "ForReshardingOnly"
    },
    "state_snapshot_enabled": false
  },
  "split_storage": {
    "enable_split_storage_view_client": true
  }
}
maximdogonov commented 2 weeks ago

Up!

MichaelLLC commented 2 weeks ago

The problem is reproduced only on version 2.1.1 built without SHA-NI support by commit: https://github.com/near/nearcore/commit/6c39e409db95f888863c834bb482ae1cea7e5403

MichaelLLC commented 2 weeks ago

On the recommendation of the team

Dear node operators,

The 2.1.1 release enforces that CPUs have sha-ni extensions. 
Upcoming features will require the performance boost from sha-ni. 

If you don’t have sha-ni support, your node will crash loop and logs will repeat right after the DB is open. If you check dmesg, you will see the message Illegal instruction (core dumped). At this point  your options are to either  change your node type with your cloud provider to something that supports sha-ni or continue running on unsupported hardware at your own risk.

If you decide to keep running without sha-ni, you need to recompile the neard binary with this commit on top of the 2.1.1 branch.
https://github.com/near/nearcore/commit/6c39e409db95f888863c834bb482ae1cea7e5403
telezhnaya commented 1 week ago

Have you tried to wait some more time? https://github.com/near/nearcore/issues/11936