[Stacks 2.1] 500 response from `GET /v2/pox`

zone117x commented 1 year ago

Running the latest next branch in mocknet mode, and the GET /v2/pox endpoint often returns an HTTP 500 error with the message: Failed to query peer info. On my machine, there's around a 2 second window of time where the 500 errors are returned.

It's easier to reproduce this issue with a short mocknet block time. So perhaps some race condition that happens around block assembly / broadcast time?

Here's a snippet of logs during a 500 request, with STACKS_LOG_DEBUG=1: stacks-node-500.log.txt

It looks like these are the relevant lines:

INFO [1661342705.107448] [testnet/stacks-node/src/tenure.rs:70] [main] Node starting new tenure with VRF f1d24b909c98118e60f4396d65ab320dfa7c907c43d5de2031f8b170203441a3
DEBG [1661342705.199959] [src/net/p2p.rs:4918] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): load chain view for burn block 30
DEBG [1661342705.204901] [src/net/inv.rs:2530] [ThreadId(2)] Awaken inv sync to re-scan peer block inventories at height 28
DEBG [1661342705.205719] [src/net/inv.rs:2238] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): PoX bit vector is PoxId([true]) (reloaded=true)
INFO [1661342705.206280] [src/net/http.rs:1635] [ThreadId(2)] Handle HTTPRequest, verb: GET, peer_addr: 127.0.0.1:55041, path: /v2/pox, query: 
WARN [1661342705.233753] [src/net/rpc.rs:695] [ThreadId(2)] Failed to get PoX info GetPoxInfo(HttpRequestMetadata { version: Http11, peer: PeerHost::DNS(localhost,20443), keep_alive: true, canonical_stacks_tip_height: None }, UseLatestAnchoredTip): DBError(NotFoundError)
DEBG [1661342705.234666] [src/net/rpc.rs:3092] [ThreadId(2)] Processed HTTPRequest, path: /v2/pox, processing_time_ms: 28, conn_id: 8, peer_addr: 127.0.0.1:55041
DEBG [1661342705.234893] [src/net/p2p.rs:3611] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): network work state is GetPublicIP
WARN [1661342705.234911] [src/net/p2p.rs:2453] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): failed to learn public IP: NoSuchNeighbor
DEBG [1661342705.234920] [src/net/p2p.rs:2415] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): reset public IP query state
DEBG [1661342705.234928] [src/net/p2p.rs:3611] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): network work state is BlockInvSync
DEBG [1661342705.234937] [src/net/p2p.rs:2472] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): network inventory sync
DEBG [1661342705.234946] [src/net/inv.rs:2254] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): Inventory state has 0 block stats tracked
DEBG [1661342705.235067] [src/net/inv.rs:2518] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): Sampled 0 outbound sync peers
DEBG [1661342705.235402] [src/net/inv.rs:2384] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): inventory sync finished; sortition start is 1
DEBG [1661342705.235417] [src/net/inv.rs:2409] [ThreadId(2)] local.80000000://(bind=0.0.0.0:20444)(pub=None): inv sync finished, learned something new (have 0 inv states), ibd: false, num_inv_syncs: 0, num_sync_neighbors: 0

Using config.toml to bootstrap to epoch 2.1:

[burnchain]
chain = "bitcoin"
mode = "mocknet"
commit_anchor_block_within = 1500

[[burnchain.epochs]]
epoch_name = "1.0"
start_height = 0

[[burnchain.epochs]]
epoch_name = "2.0"
start_height = 0

[[burnchain.epochs]]
epoch_name = "2.05"
start_height = 1

[[burnchain.epochs]]
epoch_name = "2.1"
start_height = 2

zone117x commented 1 year ago

I found this error is triggered on this line: https://github.com/stacks-network/stacks-blockchain/blob/7d960260b525d270e97bbafc8f09bd77d883b37b/src/net/rpc.rs#L420

This get_ancestor_sort_id call returns db_error:NotFoundError https://github.com/stacks-network/stacks-blockchain/blob/7d960260b525d270e97bbafc8f09bd77d883b37b/src/chainstate/burn/db/sortdb.rs#L3488-L3490

I'm not sure what the fix is here. Perhaps a mismatch between the reward start height and sortition ID?

zone117x commented 1 year ago

Also of note: I think the is_pox_active: cur_cycle_pox_active value from the above line (and the parent object current_cycle) is the only part of this function that doesn't respect the tip parameter. Although, I'm not sure if a non-default tip param is explicitly supported in other parts of this /v2/pox fn.

jcnelson commented 1 year ago

This can be reproduced if you hit up /v2/pox aggressively during when a sortition is being processed.

jcnelson commented 1 year ago

Related: don't use the tip argument at all since the sortition DB won't honor it. Also, see if we can get a "read transaction" for querying all the state we need in one query.

zone117x commented 1 year ago

@jcnelson accidental close?

jcnelson commented 1 year ago

yup, fat-fingered

jcnelson commented 1 year ago

I think this is fixed in next if you wanna try it again.

zone117x commented 1 year ago

Unfortunately it doesn't appear to be fixed. Running into the same problem (occurs at the same frequency as originally reported).

Pulling in the temp work-around from https://github.com/stacks-network/stacks-blockchain/pull/3281 still works in preventing the 500 responses (it just ignores errors from that sortdb.is_pox_active(burnchain, &burnchain_tip) line.

Something is up with that is_pox_active function, I've gone down a few rabbit holes trying to debug why but no luck.

jcnelson commented 1 year ago

Does this happen if v1_unlock_height is the same as epoch 2.1's start_height?

zone117x commented 1 year ago

Does this happen if v1_unlock_height is the same as epoch 2.1's start_height?

I'm actually unable to get the node to mine the epoch 2.1 start block when setting [burnchain.pox_2_activation] to the same value as [[burnchain.epochs]] epoch 2.1 start_height.

I get the following error message repeating:

ERRO [1667850564.225276] [testnet/stacks-node/src/neon_node.rs:1774] [miner-block-http://0.0.0.0:20443] Relayer: Failure fetching recipient set: ChainstateError(ClarityError(Interpreter(Unchecked(NoSuchContract("ST000000000000000000002AMW42H.pox-2")))))

Here's the config I'm using:

[node]
name = "krypton-node"
rpc_bind = "0.0.0.0:20443"
p2p_bind = "0.0.0.0:20444"
working_dir = "/chainstate/stacks-blockchain-data"

miner = true
use_test_genesis_chainstate = true
pox_sync_sample_secs = 1
wait_time_for_blocks = 0
wait_time_for_microblocks = 50
microblock_frequency = 1000

[miner]
first_attempt_time_ms = 30000
subsequent_attempt_time_ms = 5000

[connection_options]
disable_block_download = true
disable_inbound_handshakes = true
disable_inbound_walks = true
public_ip_address = "1.1.1.1:1234"

[burnchain]
chain = "bitcoin"
mode = "krypton"
poll_time_secs = 1
peer_host = "localhost"
peer_port = 18444
rpc_port = 18443
rpc_ssl = false
username = "btc"
password = "btc"
timeout = 30

pox_2_activation = 105 # <---- mining error when set to the same value as epoch 2.1 start height

[[burnchain.epochs]]
epoch_name = "1.0"
start_height = 0

[[burnchain.epochs]]
epoch_name = "2.0"
start_height = 103

[[burnchain.epochs]]
epoch_name = "2.05"
start_height = 104

[[burnchain.epochs]]
epoch_name = "2.1"
start_height = 105

I also get the same error when bumping it to one block higher. In the above config, I need to set pox_2_activation to 107 (2 higher than the epoch21 start height) in order for the miner to progress.

It also works if I set both the epoch21 start height and pox_2_activation to 107.

Is pox_2_activation rounding down to the height of the previous pox cycle?

gregorycoppola commented 1 year ago

@zone117x i'm not sure what the discrepancy is, but I saw it work with

[node]
name = "krypton-node"
rpc_bind = "0.0.0.0:20443"
p2p_bind = "0.0.0.0:20444"
# working_dir = "$DATA_DIR"

seed = "9e446f6b0c6a96cf2190e54bcd5a8569c3e386f091605499464389b8d4e0bfc201"
local_peer_seed = "9e446f6b0c6a96cf2190e54bcd5a8569c3e386f091605499464389b8d4e0bfc201"

miner = true
use_test_genesis_chainstate = true
pox_sync_sample_secs = 5
wait_time_for_blocks = 0
wait_time_for_microblocks = 1000
microblock_frequency = 5000
# mine_microblocks = true
# max_microblocks = 10
#
[[events_observer]]
endpoint = "localhost:50303"
retry_count = 255
events_keys = ["*"]

[miner]
first_attempt_time_ms = 30000
subsequent_attempt_time_ms = 5000
block_reward_recipient = "ST9V4MGBMGHX0WR3MG6JA53HNKMJ2HF9VWRQNTKR"

[connection_options]
# inv_sync_interval = 10
# download_interval = 10
# walk_interval = 10
disable_block_download = true
disable_inbound_handshakes = true
disable_inbound_walks = true
public_ip_address = "1.1.1.1:1234"

[burnchain]
chain = "bitcoin"
mode = "krypton"
poll_time_secs = 1
pox_2_activation = 105

### bitcoind-regtest connection info
peer_host = "127.0.0.1"
peer_port = 18444
rpc_port = 18443
rpc_ssl = false
username = "krypton"
password = "krypton"
timeout = 30

[[burnchain.epochs]]
epoch_name = "1.0"
start_height = 0

[[burnchain.epochs]]
epoch_name = "2.0"
start_height = 103

[[burnchain.epochs]]
epoch_name = "2.05"
start_height = 104

[[burnchain.epochs]]
epoch_name = "2.1"
start_height = 105

zone117x commented 1 year ago

The problem is some block height configs (seemingly arbitrarily) don't work. My guess is something related to pox_2_activation needing to be on a pox cycle boundary relative to one of the start_height configs.

zone117x commented 1 year ago

Also this thread is getting a bit off topic (although the latest issue we are discussing might be related to the initial issue idk). The focus of this issue to fix the HTTP 500 responses returned from GET /v2/pox.

Should I open a new issue for Miner halts with error for certain block height configs?

cc @jcnelson ?

jcnelson commented 1 year ago

Should I open a new issue for Miner halts with error for certain block height configs?

Yes, let's do that.

zone117x commented 1 year ago

The /v2/pox HTTP 500 error was fixed by @jcnelson in https://github.com/stacks-network/stacks-blockchain/pull/3399

I'm closing this issue, and opening a new one for the Miner halts with error for certain block height configs issue discussed above.

stacks-network / stacks-core

[Stacks 2.1] 500 response from `GET /v2/pox` #3261