stacks-network / stacks-core

The Stacks blockchain implementation
https://docs.stacks.co
GNU General Public License v3.0
3.01k stars 670 forks source link

Argon node intermittently crashes on startup #1795

Closed whoabuddy closed 4 years ago

whoabuddy commented 4 years ago

Wondering if this is related to #1781 at all?

Describe the bug

I ran the argon miner via this script, which I now realize did not have debug enabled. I can run the process again with BLOCKSTACK_DEBUG=1 and RUST_BACKTRACE=full and report a second log if needed, but for now, I am preserving the directory at @jcnelson's request.

Steps To Reproduce

  1. rm -rf ./stacks-blockchain
  2. script 20200805-stacks-blockchain <-- handy way to record terminal output!
  3. curl -o- https://raw.githubusercontent.com/AbsorbingChaos/bks-setup-miner/master/config-miner-argon.sh | bash

Expected behavior

To still have a running testnet miner :slightly_smiling_face:

Environment (please complete the following information):

Additional context

Testnet started with cargo testnet start: https://github.com/AbsorbingChaos/bks-setup-miner/blob/38b29b713d073073278093fb57869f6f2bfda701/config-miner-argon.sh#L136

Attached is a text file of all miner output from start of script through panic.

20200805-stacks-blockchain-panic.txt

Last bunch of lines from the text file, since it includes the build and the miner running for a while, it's quite long!

INFO [1596665228.545] [testnet/stacks-node/src/neon_node.rs:834] [ThreadId(6)] Stacks block assembled: 999cfa20d8942286c30e82b02298699283a4e31257949d612b42c50d3a2b21e9, with 1 txs
INFO [1596665229.155] [testnet/stacks-node/src/burnchains/bitcoin_regtest_controller.rs:378] [ThreadId(6)] Miner node: submitting leader_block_commit op - 02133c62c04986d5d9b63a77305b0d15e410e1ac82c034b6223903ecfac7637b77
INFO [1596665253.742] [src/burnchains/burnchain.rs:704] [ThreadId(1)] Node will fetch burnchain blocks 15727-15728...
INFO [1596665255.382] [testnet/stacks-node/src/neon_node.rs:891] [ThreadId(1)] Received burnchain block #15728 including block_commit_op - muKRMCFx6ozidGLnsw7PmahCTTf5WPAPED
INFO [1596665255.386] [testnet/stacks-node/src/neon_node.rs:882] [ThreadId(1)] Received burnchain block #15728 including block_commit_op (winning) - mx2uds6sgnn9znABQ6iDSSmXY9K5D4SHF9
INFO [1596665255.423] [testnet/stacks-node/src/neon_node.rs:902] [ThreadId(1)] Received burnchain block #15728 including key_register_op - ST3MMDYNCCSYKB9E77KD9QD8RG2QY72X6V444X0RX
INFO [1596665255.427] [testnet/stacks-node/src/neon_node.rs:902] [ThreadId(1)] Received burnchain block #15728 including key_register_op - ST2TJRHDHMYBQ417HFB0BDX430TQA5PXRX6495G1V
WARN [1596665255.428] [testnet/stacks-node/src/neon_node.rs:499] [ThreadId(6)] Did not win sortition, my blocks [burn_hash= 01a5b972122fabfab6a51ef01509530732982202e088f3c8a563a51f2ec469a2, block_hash= 999cfa20d8942286c30e82b02298699283a4e31257949d612b42c50d3a2b21e9], their blocks [par_burn_hash= 01a5b972122fabfab6a51ef01509530732982202e088f3c8a563a51f2ec469a2, burn_hash= 76af810faeaeddfe8a94126e6c08154fabfc1c697804204d3e99b7926c6a16fe, block_hash =e6a9339d5ec41e34bc0a92ef45b2c2f2e66d1f2e96c5f77ab73a1afbaa45461a]
WARN [1596665255.454] [testnet/stacks-node/src/neon_node.rs:684] [ThreadId(1)] Skipped tenure because no active VRF key. Trying to register one.
INFO [1596665256.144] [testnet/stacks-node/src/burnchains/bitcoin_regtest_controller.rs:328] [ThreadId(6)] Miner node: submitting leader_key_register op - 02133c62c04986d5d9b63a77305b0d15e410e1ac82c034b6223903ecfac7637b77
INFO [1596665260.849] [src/net/relay.rs:800] [ThreadId(6)] Processing newly received blocks: 1
INFO [1596665287.263] [src/burnchains/burnchain.rs:704] [ThreadId(1)] Node will fetch burnchain blocks 15728-15729...
INFO [1596665289.222] [testnet/stacks-node/src/neon_node.rs:882] [ThreadId(1)] Received burnchain block #15729 including block_commit_op (winning) - mx2uds6sgnn9znABQ6iDSSmXY9K5D4SHF9
INFO [1596665289.226] [testnet/stacks-node/src/neon_node.rs:891] [ThreadId(1)] Received burnchain block #15729 including block_commit_op - n2nQmUHj4FoFW6dv7LngnmANs92YwRSu8h
INFO [1596665289.257] [testnet/stacks-node/src/neon_node.rs:902] [ThreadId(1)] Received burnchain block #15729 including key_register_op - ST2TJRHDHMYBQ417HFB0BDX430TQA5PXRX6495G1V
INFO [1596665289.274] [testnet/stacks-node/src/neon_node.rs:902] [ThreadId(1)] Received burnchain block #15729 including key_register_op - ST2BP6PK4V5FKBZS5Q3T2QXF1DGWTF17EBBTKAT72
INFO [1596665289.284] [testnet/stacks-node/src/neon_node.rs:678] [ThreadId(1)] Sleeping 0 before issuing tenure
INFO [1596665289.604] [src/chainstate/stacks/miner.rs:499] [ThreadId(6)] Miner: mined anchored block 1df292be329e8b900b9ddcb695933533e7ab5316787a9976beffedf544f91527, parent block 425cd694ec164073a90da3c2d5e6a1beabd6b455c68e47f0fdafb529ea3dd982, state root = 515a5785c2d352a75a350f20191fd3640e08ef9c5e51532533784e59ce78acc6
INFO [1596665289.620] [testnet/stacks-node/src/neon_node.rs:834] [ThreadId(6)] Stacks block assembled: 1df292be329e8b900b9ddcb695933533e7ab5316787a9976beffedf544f91527, with 1 txs
INFO [1596665290.195] [testnet/stacks-node/src/burnchains/bitcoin_regtest_controller.rs:378] [ThreadId(6)] Miner node: submitting leader_block_commit op - 02133c62c04986d5d9b63a77305b0d15e410e1ac82c034b6223903ecfac7637b77
INFO [1596665314.796] [src/burnchains/burnchain.rs:704] [ThreadId(1)] Node will fetch burnchain blocks 15729-15730...
ERROR [1596665316.313] [src/chainstate/burn/db/sortdb.rs:1242] [ThreadId(7)] Failed to load snapshot for block 15729 from fork 7f1df7d4744c0d883bbf86a8b63b908d98db06b6bb9b59c65dfe05ce0cfc747b
ERROR [1596665316.329] [testnet/stacks-node/src/neon_node.rs:288] [ThreadId(7)] P2P: Failed to process network dispatch: DBError(Corruption)
Process abort due to thread panic
INFO [1596665316.393] [testnet/stacks-node/src/neon_node.rs:891] [ThreadId(1)] Received burnchain block #15730 including block_commit_op - muKRMCFx6ozidGLnsw7PmahCTTf5WPAPED
INFO [1596665316.397] [testnet/stacks-node/src/neon_node.rs:882] [ThreadId(1)] Received burnchain block #15730 including block_commit_op (winning) - mx2uds6sgnn9znABQ6iDSSmXY9K5D4SHF9
INFO [1596665316.418] [testnet/stacks-node/src/neon_node.rs:902] [ThreadId(1)] Received burnchain block #15730 including key_register_op - ST3MMDYNCCSYKB9E77KD9QD8RG2QY72X6V444X0RX
INFO [1596665316.423] [testnet/stacks-node/src/neon_node.rs:902] [ThreadId(1)] Received burnchain block #15730 including key_register_op - ST2TJRHDHMYBQ417HFB0BDX430TQA5PXRX6495G1V
WARN [1596665316.424] [testnet/stacks-node/src/neon_node.rs:499] [ThreadId(6)] Did not win sortition, my blocks [burn_hash= 292b022c34cd2c5220e8e5cb105d9b634b77098eaddd8361999c8c6db23f806b, block_hash= 1df292be329e8b900b9ddcb695933533e7ab5316787a9976beffedf544f91527], their blocks [par_burn_hash= 292b022c34cd2c5220e8e5cb105d9b634b77098eaddd8361999c8c6db23f806b, burn_hash= 7f1df7d4744c0d883bbf86a8b63b908d98db06b6bb9b59c65dfe05ce0cfc747b, block_hash =cca70d6343183f8ec0ebf34daa2f39b9a42b5265ad93fd957fc127e1eb2a9b14]
WARN [1596665316.433] [testnet/stacks-node/src/neon_node.rs:684] [ThreadId(1)] Skipped tenure because no active VRF key. Trying to register one.
INFO [1596665317.025] [testnet/stacks-node/src/burnchains/bitcoin_regtest_controller.rs:328] [ThreadId(6)] Miner node: submitting leader_key_register op - 02133c62c04986d5d9b63a77305b0d15e410e1ac82c034b6223903ecfac7637b77
   0: stacks_node::main::{{closure}}
             at testnet/stacks-node/src/main.rs:44
   1: std::panicking::rust_panic_with_hook
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/panicking.rs:474
   2: std::panicking::begin_panic
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/panicking.rs:397
   3: stacks_node::neon_node::spawn_peer::{{closure}}
             at testnet/stacks-node/src/neon_node.rs:289
   4: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/sys_common/backtrace.rs:130
   5: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/thread/mod.rs:475
   6: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/panic.rs:318
   7: std::panicking::try::do_call
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/panicking.rs:303
   8: __rust_maybe_catch_panic
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libpanic_unwind/lib.rs:86
   9: std::panicking::try
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/panicking.rs:281
  10: std::panic::catch_unwind
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/panic.rs:394
  11: std::thread::Builder::spawn_unchecked::{{closure}}
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/thread/mod.rs:474
  12: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libcore/ops/function.rs:232
  13: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/liballoc/boxed.rs:1017
  14: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/liballoc/boxed.rs:1017
      std::sys_common::thread::start_thread
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/sys_common/thread.rs:13
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd/src/libstd/sys/unix/thread.rs:80
  15: start_thread
  16: clone
jcnelson commented 4 years ago

Thanks for all the information @whoabuddy! It does look like the same underlying issue. Do you happen to know what the last git commit on master was when you built your node? Because, #1774 should have fixed this issue (I'd be very curious to know if this problem occurred after #1774 was merged).

whoabuddy commented 4 years ago

Happy to help! I can check when I'm at the office again, but it may not be until Monday. I haven't done anything else with the VM since uploading the info here in case we needed more information.

Given that PR was merged on 7/28, and I deleted the stacks-blockchain directory then cloned the repo fresh on 8/6, my guess is it should have included that commit from master.

psq commented 4 years ago

@jcnelson got the same crash twice right after startup today (after burnchain download completes). Definitely have d8d4043eb046b0e9f57afd337044646fa25dc231. Can't load the snapshot for either the last known block, and the one before that (the 2 crashes look slightly different for what happened before)

Will see if I can reproduce with possibly more logs.

jcnelson commented 4 years ago

Thanks for getting back to me on this @psq! Were these nodes booted up from an empty chainstate, or an existing chainstate? The node currently doesn't make any guarantees about the consistency of the data of an existing chainstate directory -- there isn't yet a graceful shutdown feature, so it's possible to see a corruption error on restarting with an existing chainstate directory (especially one that could arise from the MARF recording that a snapshot exists, but the snapshot DB not returning it).

psq commented 4 years ago

empty chain state. I think it is consistent with some I saw last week. It is like there is a point during boot up where this is quite likely to happen, but if you go past that point, you are fine, even though it could still happen as it happened to @whoabuddy (log looks very similar, so I'm assuming that's the same issue). I just got a successful startup, so I'll try to compare the 2. I was running a few changes from master, but mostly extra logs, and extra configuration code (but same default values), so that should not matter here.

psq commented 4 years ago

@kantai when you wrote the comment on this line https://github.com/blockstack/stacks-blockchain/blob/afc748aa1615ff966675a828f02efd78e092c2d8/src/chainstate/burn/db/sortdb.rs#L1241 why were you saying this shouldn't be possible?

Seems to be happening more than you'd wish, and then panicking at https://github.com/blockstack/stacks-blockchain/blob/master/testnet/stacks-node/src/neon_node.rs#L289 because of the error returned.

Any possible cause for the snapshot not being there when it should?

jcnelson commented 4 years ago

Technically, I wrote that line, and @kantai later modified it.

That branch shouldn't be reachable if everything is working right (and clearly, it is not) because the stable block height is always less than or equal to the chain height, meaning that there must always be a corresponding snapshot for it. @kantai might the fact that this code is being reached be due to the off-by-one error you were describing earlier, in PR #1787?

kantai commented 4 years ago

I don't think that off-by-one would cause this -- I'm pretty sure that that bug isn't present in master in any case.

My initial suspicion here is that get_block_snapshot_by_height is expected an "effective" block height (i.e., block-height - first-block-height) or a similar such error.

psq commented 4 years ago

Technically, I wrote that line, and @kantai later modified it.

ah, sorry! I went with what git blame was giving me...

Can't speak for what is happening with @whoabuddy, although without the timing changes, the node can get behind in getting snapshots processed, but in my 2 cases, my node only had processed about 1/3 of the snapshots (at least to a consistent height where it has all the prior snapshots), so retrieving the snapshot for the most recent burn block is going to fail (I'm still a bit unclear on how the different storage areas relate to each other, and in what order it is supposed to happen when everything goes fine). May be it is related to starting to mine too early (https://github.com/blockstack/stacks-blockchain/issues/1745) as the state has not reached a fully consistent shape?

I'll look into how the stable block height is calculated, maybe it is too eager when the node is starting up? Or needs to catch up later on (although if that's really the reason, I would expect the panic to happen a lot more often).

Also, maybe using panic here is too drastic, as the database is not complete yet, and it may appear to be corrupted, but I suspect this is only temporary.

whoabuddy commented 4 years ago

I have been following everything here but have not gone into the other office yet. I've been in sort of a "transition phase" between locations but definitely plan to get physical access to that computer and its virtual miners today. From there I can check the git log and do some further testing, as well as restore my remote access.

In the meantime, I modified my shell script with a "debug mode" that:

I used that version of the shell script yesterday with a virtual miner on this laptop with the same configuration (Virtualbox, Ubuntu Server, 1gb RAM) and it's been running for ~14 hours with no issues now. :man_shrugging: I will leave it running and report back if it panics the same way, but so far, so good.

I wish I could read/interpret the logs or understand the operations here as well as everyone else in the thread, but I'm doing my best to keep up and if there is any additional information I can grab that may be helpful, let me know!

whoabuddy commented 4 years ago

Update: @jcnelson I finally got in front of that miner today, the git log shows this at the head:

commit afc748aa1615ff966675a828f02efd78e092c2d8 Merge: 2f7611c8e e86cc9571 Author: Aaron Blankstein aaron@blockstack.com Date: Sun Aug 2 11:10:55 2020 -0500 Merge pull request #1782 from blockstack/bugtemplate-patch-1 Update testnet-bug.md

and about 8 commits back:

commit d8d4043eb046b0e9f57afd337044646fa25dc231 Merge: e147ec3ed 72deb87f1 Author: Jude Nelson judecn@gmail.com Date: Mon Jul 27 20:28:40 2020 +0000 Merge pull request #1774 from blockstack/fix/marf-1557 Fix/marf 1557

I can upload the log if needed. I want to also note that the node running on my laptop is still running without issue, so I was not able to replicate the issue on a different VM.

I am going to start the other 3 miner VMs on this machine with the debug settings enabled, so if they do panic I'll have a more detailed log to share (and it will definitely be the most updated code/a fresh directory).

Is there any additional information I can pull from this VM that panicked that would be helpful? For now I will leave it be just in case, otherwise I will start it back up fresh with debug enabled as well.

psq commented 4 years ago

@whoabuddy the log and the content of the data folder would be helpful. I have 2 cases I have been looking at locally, but these happen during the initial sync, not later, so yours might have something different.

whoabuddy commented 4 years ago

@psq I'm back in front of this machine today, and excuse my ignorance, I am not sure what files/paths you need in particular.

Back in my first post I included a copy of the terminal output (without debug), and I did my best to find the folders on the VM but still wasn't sure. I do see a number of files in stacks-blockchain/target/debug - maybe that's the one you are looking for?

I am happy to preserve this machine as-is and provide the information once I better understand.

On a side note - happy to report my laptop and the other 3 nodes on this machine are running normally so far, so not sure what was going on with this one, but always happy to dig in and learn more!

psq commented 4 years ago

the data folder is in /tmp by default, so look there. If you were not running anything else, most likely the folder with the most recent timestamp should be the one we need. If you look in your log, you should see the path it is using repeated quite often. For example, mine shows:

./.stack/burnchain/db/bitcoin/regtest/sortition.db/data.db

but that's because I've set

working_dir = "./.stack"

in my configuration file.

Yours should say something like

/tmp/?????/burnchain/db/bitcoin/regtest/sortition.db/data.db

The top level of that data directory should have

burnchain
chainstate
peer_db.sqlite

And the other thing is the full log (either from the file if you redirected the output, or by copying the whole content of the console.

whoabuddy commented 4 years ago

@psq Thank you! Using the node on my laptop I can see the files in /tmp, so that's what I needed! I am not in front of that desktop to grab the data directory, but this is a copy of the log (linked in the first post here as well). I will work on getting the data directory from that VM next time I'm in front of it.

The node running on my laptop panicked this morning, and based on the script recording, it ran from 2020-08-11 17:35:32+00:00 through 2020-08-17 13:00:49+00:00.

Steps To Reproduce

curl -sS -O https://raw.githubusercontent.com/AbsorbingChaos/bks-setup-miner/master/config-miner-argon.sh
bash config-miner-argon.sh debug

Debug mode is described in the readme if more info is needed on the script.

Environment (please complete the following information):

OS: Ubuntu Server 20.04 Rust version: rustc 1.44.1

Additional context

Testnet started with BLOCKSTACK_DEBUG=1 RUST_BACKTRACE=full cargo testnet start: https://github.com/AbsorbingChaos/bks-setup-miner/blob/master/config-miner-argon.sh#L172

The last 50 lines of the log

DEBUG [1597669248.582] [src/burnchains/bitcoin/indexer.rs:458] [ThreadId(1)] Sync all headers starting at block 22244
DEBUG [1597669248.593] [src/burnchains/bitcoin/spv.rs:394] [ThreadId(1)] Initialized block headers at /tmp/stacks-testnet-c916a61c99f04694/burnchain/spv-headers.dat
DEBUG [1597669248.593] [src/burnchains/bitcoin/indexer.rs:373] [ThreadId(1)] (Re)establish peer connection
DEBUG [1597669248.669] [src/burnchains/bitcoin/network.rs:164] [ThreadId(1)] Begin peer handshake to 35.236.218.197:18444
DEBUG [1597669248.672] [src/burnchains/bitcoin/network.rs:251] [ThreadId(1)] Send version (nonce=12309017456111387569) to 35.236.218.197:18444
DEBUG [1597669248.834] [src/burnchains/bitcoin/network.rs:259] [ThreadId(1)] Handle version -- remote peer blockchain height is 22244
DEBUG [1597669248.836] [src/burnchains/bitcoin/network.rs:274] [ThreadId(1)] Send verack
DEBUG [1597669248.838] [src/burnchains/bitcoin/network.rs:283] [ThreadId(1)] Handle verack
DEBUG [1597669248.839] [src/burnchains/bitcoin/network.rs:172] [ThreadId(1)] Established connection to 35.236.218.197:18444, who has 22244 blocks
DEBUG [1597669248.839] [src/burnchains/bitcoin/spv.rs:656] [ThreadId(1)] Have all headers up to 22244
DEBUG [1597669248.839] [src/burnchains/bitcoin/indexer.rs:394] [ThreadId(1)] begin_session() terminates conversation
DEBUG [1597669248.839] [src/burnchains/burnchain.rs:698] [ThreadId(1)] Sync'ed headers from 22244 to 22244. DB at 22244
DEBUG [1597669248.841] [testnet/stacks-node/src/burnchains/bitcoin_regtest_controller.rs:202] [ThreadId(1)] Done receiving blocks
DEBUG [1597669249.382] [testnet/stacks-node/src/burnchains/bitcoin_regtest_controller.rs:792] [ThreadId(6)] BitcoinRPC builder: None:None@http://35.236.218.197:18443/
   0: stacks_node::main::{{closure}}
             at testnet/stacks-node/src/main.rs:44
   1: std::panicking::rust_panic_with_hook
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/panicking.rs:515
   2: std::panicking::begin_panic
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/panicking.rs:438
   3: stacks_node::neon_node::spawn_peer::{{closure}}
             at testnet/stacks-node/src/neon_node.rs:289
   4: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/sys_common/backtrace.rs:130
   5: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/thread/mod.rs:475
   6: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/panic.rs:318
   7: std::panicking::try::do_call
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/panicking.rs:331
   8: __rust_try
   9: std::panicking::try
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/panicking.rs:274
  10: std::panic::catch_unwind
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/panic.rs:394
  11: std::thread::Builder::spawn_unchecked::{{closure}}
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/thread/mod.rs:474
  12: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libcore/ops/function.rs:232
  13: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/liballoc/boxed.rs:1008
      <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/liballoc/boxed.rs:1008
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/c7087fe00d2ba919df1d813c040a5d47e43b0fe7/src/libstd/sys/unix/thread.rs:87
  14: start_thread
  15: clone

The last 10 commits via git log

afc748aa1 Merge pull request #1782 from blockstack/bugtemplate-patch-1 e86cc9571 Update testnet-bug.md 2f7611c8e Update testnet-bug.md 68a5396e3 Merge pull request #1775 from blockstack/fix/net-1750 2e38bade1 remove obsolete debug output 411c7af98 add test for get_parent_block_header ee799727b remove unnecessary argument 0bab320f0 fix #1750 -- the problem was that we weren't using the right anchor block hash to load a block's header d8d4043eb Merge pull request #1774 from blockstack/fix/marf-1557 e147ec3ed Merge pull request #1769 from blockstack/feat/bitcoin-gh-tests

The full log file (with debug enabled)

Since this node ran for almost 6 days, the log file is 5gb uncompressed, 514mb compressed. :eyes:

https://envl.app/absorbingchaos/SsH3pE!g6TTPpMOgrI7JBEF

The data directory

This one was quite large as well, 557mb compressed.

https://envl.app/absorbingchaos/3C4bmv!2Psq2a5vErdpfjQa

timstackblock commented 4 years ago

Thank you for the submission @whoabuddy 👍

timstackblock commented 4 years ago

@jcnelson @psq @whoabuddy

Quick question just to clarify, this bug looks very similar to #1781 , can I label this bug as a duplicate?

psq commented 4 years ago

@timstackblock the crashes happen in different threads, so they are probably not directly related. The root cause may be the same, but not possible to tell yet.

timstackblock commented 4 years ago

Thanks for the info @psq, I will watch the issue and follow up once there has been more traction.

danielimkk commented 4 years ago

@timstackblock Actually I run into this issue as well. It happens once my node syncs and works for a while.

I've been launching a node with sidecar API attached. In order to run that, I've been using the following docker images: 1) For stacks node - blockstack/stacks-blockchain:latest 2) For API- blockstack/stacks-blockchain-api:latest 3) For DB - postgres:12.2

Bellow is my list of environment variables: Postgres environment variables:

  POSTGRES_DB: stacks_blockchain_api
  POSTGRES_USER: postgres
  POSTGRES_PASSWORD: password
  POSTGRES_LOGGING: true
  POSTGRES_PORT: 5490

Stacks Node environment variables :

  STACKS_EVENT_OBSERVER: 127.0.0.1:3700
  XBLOCKSTACK_DEBUG: 1
  RUST_BACKTRACE: 1

API environment variables:

  POSTGRES_HOST: 127.0.0.1
  POSTGRES_PORT: 5490
  PG_DATABASE: stacks_blockchain_api
  POSTGRES_USER: postgres
  POSTGRES_PASSWORD: password
  STACKS_CORE_EVENT_PORT: 3700
  STACKS_CORE_EVENT_HOST: http://0.0.0.0
  STACKS_BLOCKCHAIN_API_PORT: 3999
  STACKS_BLOCKCHAIN_API_HOST: 0.0.0.0
  STACKS_BLOCKCHAIN_API_DB: pg
  STACKS_CORE_RPC_HOST: 127.0.0.1
  STACKS_CORE_RPC_PORT: 20443
  NODE_ENV: development

Here are the last 50 strings of logs, without DEBUG enabled:

INFO [1597906202.216] [src/burnchains/burnchain.rs:704] [ThreadId(1)] Node will fetch burnchain blocks 784-785...
INFO [1597906202.855] [testnet/stacks-node/src/neon_node.rs:882] [ThreadId(1)] Received burnchain block #785 including block_commit_op (winning) - mx2uds6sgnn9znABQ6iDSSmXY9K5D4SHF9
INFO [1597906204.193] [src/net/relay.rs:800] [ThreadId(5)] Processing newly received blocks: 1
INFO [1597906234.238] [src/net/p2p.rs:2098] [ThreadId(6)] Failed to query public IP (RecvTimeout; skipping
INFO [1597906323.790] [src/burnchains/burnchain.rs:704] [ThreadId(1)] Node will fetch burnchain blocks 785-786...
INFO [1597906324.430] [testnet/stacks-node/src/neon_node.rs:882] [ThreadId(1)] Received burnchain block #786 including block_commit_op (winning) - mx2uds6sgnn9znABQ6iDSSmXY9K5D4SHF9
INFO [1597906325.434] [src/net/relay.rs:800] [ThreadId(5)] Processing newly received blocks: 1
WARN [1597906364.306] [src/net/p2p.rs:943] [ThreadId(6)] Failed to get peer address of TcpStream { addr: V4(192.168.186.228:45644), fd: 37 }: Os { code: 107, kind: NotConnected, message: "Socket not connected" }
INFO [1597906364.306] [src/net/p2p.rs:1753] [ThreadId(6)] local.80000000://(bind=V4(0.0.0.0:20444))(pub=None): Failed to connect to myself for IP confirmation
INFO [1597906446.394] [src/burnchains/burnchain.rs:704] [ThreadId(1)] Node will fetch burnchain blocks 786-787...
INFO [1597906447.013] [testnet/stacks-node/src/neon_node.rs:882] [ThreadId(1)] Received burnchain block #787 including block_commit_op (winning) - mo8CosqCssC3jgdQZEiHqbE5m1kHTGoGCX
WARN [1597906495.378] [src/net/p2p.rs:943] [ThreadId(6)] Failed to get peer address of TcpStream { addr: V4(192.168.186.228:46844), fd: 37 }: Os { code: 107, kind: NotConnected, message: "Socket not connected" }
INFO [1597906495.378] [src/net/p2p.rs:1753] [ThreadId(6)] local.80000000://(bind=V4(0.0.0.0:20444))(pub=None): Failed to connect to myself for IP confirmation
INFO [1597906495.795] [src/net/relay.rs:800] [ThreadId(5)] Processing newly received blocks: 1
INFO [1597906567.208] [src/burnchains/burnchain.rs:704] [ThreadId(1)] Node will fetch burnchain blocks 787-788...
ERROR [1597906567.901] [src/chainstate/burn/db/sortdb.rs:1242] [ThreadId(6)] Failed to load snapshot for block 787 from fork 148faf4306a778442ad9e51c94e2b72dec3d05afc24b6a94bcddc311f1d2fa4b
ERROR [1597906567.901] [testnet/stacks-node/src/neon_node.rs:288] [ThreadId(6)] P2P: Failed to process network dispatch: DBError(Corruption)
Process abort due to thread panic
INFO [1597906567.919] [testnet/stacks-node/src/neon_node.rs:882] [ThreadId(1)] Received burnchain block #788 including block_commit_op (winning) - mx2uds6sgnn9znABQ6iDSSmXY9K5D4SHF9
   0: stacks_node::main::{{closure}}
             at src/testnet/stacks-node/src/main.rs:44
   1: std::panicking::rust_panic_with_hook
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panicking.rs:490
   2: std::panicking::begin_panic
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panicking.rs:410
   3: stacks_node::neon_node::spawn_peer::{{closure}}
             at src/testnet/stacks-node/src/neon_node.rs:289
      std::sys_common::backtrace::__rust_begin_short_backtrace
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/sys_common/backtrace.rs:130
   4: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/thread/mod.rs:475
      <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panic.rs:318
      std::panicking::try::do_call
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panicking.rs:297
      std::panicking::try
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panicking.rs:274
      std::panic::catch_unwind
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/panic.rs:394
      std::thread::Builder::spawn_unchecked::{{closure}}
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/thread/mod.rs:474
      core::ops::function::FnOnce::call_once{{vtable.shim}}
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libcore/ops/function.rs:232
   5: <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/liballoc/boxed.rs:1076
      <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/liballoc/boxed.rs:1076
      std::sys::unix::thread::Thread::new::thread_start
             at rustc/d3fb005a39e62501b8b0b356166e515ae24e2e54/src/libstd/sys/unix/thread.rs:87

While links for logs can be found under these links: 1) Without DEBUG: https://drive.google.com/file/d/1aZLocBxhM_6zLzHNwYM3a0j1f-PTVf7Q/view?usp=sharing 2) With DEBUG: https://drive.google.com/file/d/14QaMRJMN2x3r1ocgJTuqvtdLYQPOVCIL/view?usp=sharing

timstackblock commented 4 years ago

Thanks for the info @danielimkk

psq commented 4 years ago

thank you for the logs, @danielimkk, nice that this one happened very quickly. Is this something you built yourself? if so what's the result of git log -1 --format=format:"%H"?

kantai commented 4 years ago

I think I have a good idea of what's going on here, but want to try and recreate the issue a few more times before concluding that this is actually the issue. Our "MARF-indexed database" transactions maintain two SQLite connections to two separate databases. This in turn means that they do not commit atomically -- this is a bug no matter how we slice it. The question is whether or not that's causing the issue here. When I started argon miners locally, occasionally, I'd see the reported behavior, caused by:

DEBUG [1598373620.643] [src/chainstate/stacks/index/trie.rs:745] [ThreadId(28)] Next root hash is f8260157ad4125061a843f265703390ab0e7fd4907358da65aba6c9b7f06b5ac (update_skiplist=false)
DEBUG [1598373620.643] [src/chainstate/stacks/index/marf.rs:470] [ThreadId(28)] MARF Insert in 3d3ec5c29dbea25072d11b2acb28392fddb7266a24a924a82363e3352414f884: '4aa445e2891853300e4901d67c6d3b850a6d4e89f9ade93e196aa35d7e888202' = '5a1883b440688c90e719f9b34c9b39a3ed60843edc2a1a1085c5dc6e7d8daddf0000000000000000' (...[])
DEBUG [1598373620.646] [src/chainstate/stacks/index/trie.rs:745] [ThreadId(28)] Next root hash is bf10d5c54abd0b9de46cc669057669a4f316d01795431b401f440cd96fdc69bf (update_skiplist=true)
DEBUG [1598373620.647] [src/chainstate/burn/db/sortdb.rs:1714] [ThreadId(28)] ACCEPTED(4679) leader key register d3c3bed111d6bb40e686d9db2ea0162ae08c8eadb3354c422d4cf6b7fce4df0d at 4679,1
DEBUG [1598373620.647] [src/chainstate/burn/db/sortdb.rs:1714] [ThreadId(28)] ACCEPTED(4679) leader key register 32dcd0afa9a57dabb9bf9eea355106afc2fa734d5623b7fc3e40b312db12bb68 at 4679,2
DEBUG [1598373620.674] [src/chainstate/burn/db/sortdb.rs:1718] [ThreadId(28)] ACCEPTED(4679) leader block commit ab1b32f738184443f6953e3c92a4cf1fa8c5c0e2fd374e306d9f1fd36467bb41 at 4679,3
DEBUG [1598373620.674] [src/chainstate/burn/db/sortdb.rs:1714] [ThreadId(28)] ACCEPTED(4679) leader key register 4b28296a759c08418eb7a4023bfa35d833ffce4c2ad573437510dc3f7b366a96 at 4679,4
DEBUG [1598373620.674] [src/chainstate/burn/db/sortdb.rs:1718] [ThreadId(28)] ACCEPTED(4679) leader block commit d30e45e0985be30922fd92be39ff2bf11234977b4e9991dc410717fdb890079a at 4679,5
DEBUG [1598373620.674] [src/chainstate/burn/db/processing.rs:133] [ThreadId(28)] OPS-HASH(4679): fc988223993ecc56d35368c5c427925eb6388e4511dd18b94dfcf97cfbad03fa
DEBUG [1598373620.674] [src/chainstate/burn/db/processing.rs:134] [ThreadId(28)] INDEX-ROOT(4679): bf10d5c54abd0b9de46cc669057669a4f316d01795431b401f440cd96fdc69bf
DEBUG [1598373620.674] [src/chainstate/burn/db/processing.rs:135] [ThreadId(28)] SORTITION-HASH(4679): 0b55822d9a908a8619c483a0e9b77e13bd591521c8f83a7bb044db6aa13476af
DEBUG [1598373620.674] [src/chainstate/burn/db/processing.rs:136] [ThreadId(28)] CONSENSUS(4679): fef2982ea1d04e4d18e780699c321deee8e8ac62
DEBUG [1598373620.706] [src/net/p2p.rs:2581] [ThreadId(6)] local.80000000://(bind=V4(0.0.0.0:20444))(pub=None): load chain view for burn block 4679
DEBUG [1598373620.707] [src/chainstate/burn/db/sortdb.rs:1533] [ThreadId(6)] No ancestor block 4678 from 3d3ec5c29dbea25072d11b2acb28392fddb7266a24a924a82363e3352414f884 in index
ERROR [1598373620.707] [src/chainstate/burn/db/sortdb.rs:1242] [ThreadId(6)] Failed to load snapshot for block 4678 from fork 3d3ec5c29dbea25072d11b2acb28392fddb7266a24a924a82363e3352414f884
ERROR [1598373620.707] [testnet/stacks-node/src/neon_node.rs:288] [ThreadId(6)] P2P: Failed to process network dispatch: DBError(Corruption)
Process abort due to thread panic
DEBUG [1598373620.712] [src/chainstate/stacks/index/storage.rs:1287] [ThreadId(28)] Flush: 3d3ec5c29dbea25072d11b2acb28392fddb7266a24a924a82363e3352414f884 to self
DEBUG [1598373620.717] [src/chainstate/stacks/index/storage.rs:1333] [ThreadId(28)] Flush: identifier of self is 4680
DEBUG [1598373620.718] [src/burnchains/burnchain.rs:768] [ThreadId(28)] Inserted block 4679 in 6245ms
DEBUG [1598373620.727] [testnet/stacks-node/src/burnchains/bitcoin_regtest_controller.rs:202] [ThreadId(1)] Done receiving blocks

Of particular note:

DEBUG [1598373620.706] [src/net/p2p.rs:2581] [ThreadId(6)] local.80000000://(bind=V4(0.0.0.0:20444))(pub=None): load chain view for burn block 4679
DEBUG [1598373620.707] [src/chainstate/burn/db/sortdb.rs:1533] [ThreadId(6)] No ancestor block 4678 from 3d3ec5c29dbea25072d11b2acb28392fddb7266a24a924a82363e3352414f884 in index

Occurs before this line:

DEBUG [1598373620.718] [src/burnchains/burnchain.rs:768] [ThreadId(28)] Inserted block 4679 in 6245ms

Which means that thread 6 is reading block 4679 as the burn chain tip before it finished committing in thead 28!

When I diagnosed the database after the crash, block 4679's MARF index was able to find its immediate ancestor (4678), which indicates that there isn't a corruption happening, but rather a synchrony issue.

To test that this is indeed creating this problem, I added a delay between the two commits to see if this bug is readily reproduced.

To deal with this, I think the MARF index and the corresponding data store must be in the same SQLite database.

jazbogross commented 4 years ago

Not sure if this is the same issue, but here is my log file. Miner stops after a while. Last lines looks the same as @whoabuddy

https://envl.app/jazbogross/8wXmew!tAQJTJhS7WUf9YQW

paradigma-cl commented 4 years ago

Here are some more logs last.log

diwakergupta commented 4 years ago

Fixed in #1876