Sortition DB at /root/stacks-blockchain/data/mainnet/burnchain/sortition does not support epoch 2.0

pleaseshutup commented 1 year ago

Seems to be running for quite a while then evetually I end up here:

here's the log output

stacks-blockchain      | Dumping core for pid 1
stacks-blockchain      | INFO [1673114851.498911] [testnet/stacks-node/src/main.rs:82] [main] stacks-node 2.05.0.6.0 (master:a71871a, release build, linux [x86_64])
stacks-blockchain      | INFO [1673114851.498929] [testnet/stacks-node/src/main.rs:140] [main] Loading config at path /src/stacks-blockchain/Config.toml
stacks-blockchain      | INFO [1673114851.500408] [testnet/stacks-node/src/event_dispatcher.rs:963] [main] Registering event observer at: stacks-blockchain-api:3700
stacks-blockchain      | ERRO [1673114851.521303] [src/chainstate/coordinator/mod.rs:927] [main] Sortition DB at /root/stacks-blockchain/data/mainnet/burnchain/sortition does not support epoch 2.0
stacks-blockchain      | WARN [1673114851.521319] [src/chainstate/coordinator/mod.rs:965] [main] Unable to migrate chainstate DBs to the latest schemas in the current epoch
stacks-blockchain      | ERRO [1673114851.521323] [testnet/stacks-node/src/run_loop/neon.rs:364] [main] FATAL: chainstate database(s) are not compatible with the current system epoch
stacks-blockchain      | ERRO [1673114851.521328] [testnet/stacks-node/src/main.rs:58] [main] Process abort due to thread panic: panicked at 'explicit panic', testnet/stacks-node/src/run_loop/neon.rs:367:17
stacks-blockchain      | ERRO [1673114851.574159] [testnet/stacks-node/src/main.rs:60] [main] Panic backtrace:    0: stacks_node::main::{{closure}}
stacks-blockchain      |              at src/testnet/stacks-node/src/main.rs:59:18
stacks-blockchain      |    1: std::panicking::rust_panic_with_hook
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:702:17
stacks-blockchain      |    2: std::panicking::begin_panic_handler::{{closure}}
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:586:13
stacks-blockchain      |    3: std::sys_common::backtrace::__rust_end_short_backtrace
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:138:18
stacks-blockchain      |    4: rust_begin_unwind
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
stacks-blockchain      |    5: core::panicking::panic_fmt
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
stacks-blockchain      |    6: core::panicking::panic
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:48:5
stacks-blockchain      |    7: stacks_node::run_loop::neon::RunLoop::instantiate_burnchain_state
stacks-blockchain      |              at src/testnet/stacks-node/src/run_loop/neon.rs:367:17
stacks-blockchain      |       stacks_node::run_loop::neon::RunLoop::start
stacks-blockchain      |              at src/testnet/stacks-node/src/run_loop/neon.rs:571:13
stacks-blockchain      |    8: stacks_node::main
stacks-blockchain      |              at src/testnet/stacks-node/src/main.rs:212:9
stacks-blockchain      |    9: core::ops::function::FnOnce::call_once
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/ops/function.rs:248:5
stacks-blockchain      |       std::sys_common::backtrace::__rust_begin_short_backtrace
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/sys_common/backtrace.rs:122:18
stacks-blockchain      |   10: std::rt::lang_start::{{closure}}
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:166:18
stacks-blockchain      |   11: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/ops/function.rs:283:13
stacks-blockchain      |       std::panicking::try::do_call
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492:40
stacks-blockchain      |       std::panicking::try
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:456:19
stacks-blockchain      |       std::panic::catch_unwind
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panic.rs:137:14
stacks-blockchain      |       std::rt::lang_start_internal::{{closure}}
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:148:48
stacks-blockchain      |       std::panicking::try::do_call
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492:40
stacks-blockchain      |       std::panicking::try
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:456:19
stacks-blockchain      |       std::panic::catch_unwind
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panic.rs:137:14
stacks-blockchain      |       std::rt::lang_start_internal
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:148:20
stacks-blockchain      |   12: main
stacks-blockchain      | 
stacks-blockchain      | Dumping core for pid 1

wileyj commented 1 year ago

The reason for this error (core-devs are aware of it) is that the initial sync was interrupted (service restart, crash due ot resource exhaustion etc). when the service attempted to restart in that type of state (interrupted iniital header sync), the above error is what you'll see.

the only resolution for now is to stop the service(s) and remove the existing chainstate then resync from genesis

pleaseshutup commented 1 year ago

I've done this process a few times now ./manage.sh -n mainnet -a reset

takes quite a while (days) to sync up. Then if something interrupts it has to start over.

Then i get here:

stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:02:20.963Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:02:26.070Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:02:31.175Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:02:36.282Z"}
stacks-blockchain      |       std::panic::catch_unwind
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panic.rs:137:14
stacks-blockchain      |       std::rt::lang_start_internal::{{closure}}
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:148:48
stacks-blockchain      |       std::panicking::try::do_call
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492:40
stacks-blockchain      |       std::panicking::try
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:456:19
stacks-blockchain      |       std::panic::catch_unwind
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panic.rs:137:14
stacks-blockchain      |       std::rt::lang_start_internal
stacks-blockchain      |              at rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/rt.rs:148:20
stacks-blockchain      |   12: main
stacks-blockchain      | 
stacks-blockchain      | Dumping core for pid 1
postgres               | 
postgres               | PostgreSQL Database directory appears to contain a database; Skipping initialization
postgres               | 
postgres               | 2023-01-08 00:55:54.744 UTC [1] LOG:  starting PostgreSQL 14.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 12.2.1_git20220924-r4) 12.2.1 20220924, 64-bit
postgres               | 2023-01-08 00:55:54.744 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
postgres               | 2023-01-08 00:55:54.744 UTC [1] LOG:  listening on IPv6 address "::", port 5432
postgres               | 2023-01-08 00:55:54.749 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
postgres               | 2023-01-08 00:55:54.755 UTC [22] LOG:  database system was shut down at 2023-01-07 18:40:13 UTC
postgres               | 2023-01-08 00:55:54.760 UTC [1] LOG:  database system is ready to accept connections
postgres               | 2023-01-09 18:48:08.280 UTC [1] LOG:  received fast shutdown request
postgres               | 2023-01-09 18:48:08.288 UTC [1] LOG:  aborting any active transactions
postgres               | 2023-01-09 18:48:08.289 UTC [75535] FATAL:  terminating connection due to administrator command
postgres               | 2023-01-09 18:48:08.289 UTC [75497] FATAL:  terminating connection due to administrator command
postgres               | 2023-01-09 18:48:08.289 UTC [75497] CONTEXT:  SQL statement "CREATE TEMP TABLE pg_temp_5.pg_temp_16748_2 AS SELECT mv.ctid AS tid, newdata.*::pg_temp_5.pg_temp_16748 AS newdata FROM public.nft_custody_unanchored mv FULL JOIN pg_temp_5.pg_temp_16748 newdata ON (newdata.asset_identifier OPERATOR(pg_catalog.=) mv.asset_identifier AND newdata.value OPERATOR(pg_catalog.=) mv.value AND newdata.* OPERATOR(pg_catalog.*=) mv.*) WHERE newdata.* IS NULL OR mv.* IS NULL ORDER BY tid"
postgres               | 2023-01-09 18:48:08.289 UTC [75497] STATEMENT:  REFRESH MATERIALIZED VIEW CONCURRENTLY "nft_custody_unanchored"
postgres               | 2023-01-09 18:48:08.289 UTC [75548] FATAL:  terminating connection due to administrator command
postgres               | 2023-01-09 18:48:08.289 UTC [35] FATAL:  terminating connection due to administrator command
postgres               | 2023-01-09 18:48:08.289 UTC [31] FATAL:  terminating connection due to administrator command
postgres               | 2023-01-09 18:48:08.290 UTC [1] LOG:  background worker "logical replication launcher" (PID 28) exited with exit code 1
postgres               | 2023-01-09 18:48:08.308 UTC [23] LOG:  shutting down
postgres               | 2023-01-09 18:48:08.408 UTC [75563] FATAL:  the database system is shutting down
postgres               | 2023-01-09 18:48:08.410 UTC [75564] FATAL:  the database system is shutting down
postgres               | 2023-01-09 18:48:08.473 UTC [1] LOG:  database system is shut down

stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:02:41.390Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:02:46.495Z"}

stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:02:51.602Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:02:56.711Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:03:01.818Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:03:06.926Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:03:12.031Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:03:17.137Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:03:22.243Z"}
stacks-blockchain exited with code 1
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:03:27.350Z"}
stacks-blockchain-api  | {"level":"error","message":"Pg connection failed: Error: getaddrinfo EAI_AGAIN postgres, retrying..","timestamp":"2023-01-10T17:03:32.456Z"}

wileyj commented 1 year ago

i'll need some specific details on the host OS (including system resources), configuration file you're using and more of the logs . the services shutting down is not a good sign, but can happen if you're trying to run on an underpowered system (in which case, init will start killing processes so the host doesn't crash).

wileyj commented 1 year ago

The API logs specifically are not what I'm expecting - is there anything before the Pg connection error lines?

pleaseshutup commented 1 year ago

It's ubuntu 22 on a r5.8xlarge instance I'm back to the error: Sortition DB at /root/stacks-blockchain/data/mainnet/burnchain/sortition does not support epoch 2.0 as originally reported.

So I think what is happening is that the instance needs ~3+ days to re-sync from genesis after i do: ./manage.sh -n mainnet -a reset

Once I am in this state there is no way to get it to resume from where it left off? Seems that I will be unable to sync this on a spot instance or a server that isn't perfectly reliable for many days in a row?

wileyj commented 1 year ago

It's ubuntu 22 on a r5.8xlarge instance I'm back to the error: Sortition DB at /root/stacks-blockchain/data/mainnet/burnchain/sortition does not support epoch 2.0 as originally reported.

So I think what is happening is that the instance needs ~3+ days to re-sync from genesis after i do: ./manage.sh -n mainnet -a reset

Once I am in this state there is no way to get it to resume from where it left off? Seems that I will be unable to sync this on a spot instance or a server that isn't perfectly reliable for many days in a row?

that could absolutely be the reason (with full logs it would be more apparent). Spot instances would be a poor choice since they can be shut down at any time. The blockchain has to shutdown gracefully or you can have a corrupt chainstate - that's what is seeming likely here (the host os and resources of that VM are perfectly fine).

this is precisely why i've set that incredibly long shutdown timeout for the blockshain (20 minutes is default, i've seen it take 10-15 minutes in some cases to gracefully shut down).

i've never attempted this on a spot instance, but if the pricing changes and the node is shutdown -- that's likely what you're seeing here.

you can use a m4/m5 instance here, those are relatively cheap - there is also a digitalocean one-click you could use (i think that's around $20/month).

there are other options as well, like quicknode - but i wouldn't use a spot instance here unless you overpay which is probably the opposite of the reason you chose a spot instance.

wileyj commented 1 year ago

for reference, i've seen this exact error:

stacks-blockchain      | ERRO [1673114851.521323] [testnet/stacks-node/src/run_loop/neon.rs:364] [main] FATAL: chainstate database(s) are not compatible with the current system epoch

when i force kill a blockchain instance (usually when i'm testing a release candidate etc, and i made a typo in my config), this is the error i would see if the headers/marf tables haven't been created before the sigkill.

it's known, and i've mentioned it on blockchain calls as well - i think the error message should be more clear as to why this error is reached.

pleaseshutup commented 1 year ago

I use spot instances because they are very inexpensive and all of my systems are very tolerant of sudden failures. Chaos monkey type stuff. Unfortunate that the stacks system isn't better at handling sudden interruptions and/or recovering a long sync operation.

Thank you for your information on this. Helps confirm my suspicions. I will have to figure out an alternative machine/service to run this node on. I chose this node because it already runs bitcoind without issue even after spot instance sudden interruptions :-)

wileyj commented 1 year ago

yea, i understand the reasoning - from the blockchain side, it's processing a lot of sortitions and burnchain data before it settles on block commit. most of the logs you see are that process, and it's a lot of intermediate I/O to the DB. sending a SIGKILL to the blockchain interrupts that process, and on restart the binary is seeing data it isn't expecting and panics.

i'd encourage you to file an issue here: https://github.com/stacks-network/stacks-blockchain/issues/new , i wonder if there's a way to recover from these unclean shutdowns.

stacks-network / stacks-blockchain-docker

Sortition DB at /root/stacks-blockchain/data/mainnet/burnchain/sortition does not support epoch 2.0 #97