nanocurrency / nano-node

Nano is digital currency. Its ticker is: XNO and its currency symbol is: Ӿ
https://nano.org
BSD 3-Clause "New" or "Revised" License
3.47k stars 785 forks source link

Assertion failed - nano::block_store_partial unchecked_put, Docker #3075

Open regfaker opened 3 years ago

regfaker commented 3 years ago

Nano node V21.2, LMDB 0.9.23, running in docker 20.10.1

Steps to reproduce the issue:

  1. Run node according to doc
  2. After 5 minutes it brings failed assertion below and container exits with code 139
Assertion (success (status)) failed
void nano::block_store_partial<Val, Derived_Store>::unchecked_put(const nano::write_transaction&, const nano::unchecked_key&, const nano::unchecked_info&) [with Val = MDB_val; Derived_Store = nano::mdb_store]
/tmp/src/nano/secure/blockstore_partial.hpp:506

 0# 0x00000000008BED83 in nano_node
 1# 0x00000000008C0024 in nano_node
 2# 0x00000000007C8D55 in nano_node
 3# 0x0000000000512B9B in nano_node
 4# 0x0000000000514BD7 in nano_node
 5# 0x0000000000515677 in nano_node
 6# 0x0000000000CFC510 in nano_node
 7# 0x00007F29D1A6E6BA in /lib/x86_64-linux-gnu/libpthread.so.0
 8# clone in /lib/x86_64-linux-gnu/libc.so.6

Environment:

logs nano_node-block_store_partial_unchecked_put.log

regfaker commented 3 years ago

I will try to clear ledger data (data.ldb*) and start again. I will make backups and can possibly share it with you if that would help.

regfaker commented 3 years ago

Hmm, same error after starting with ledger again. Last block count which I've got is

{
    "count": "1",
    "unchecked": "3895048",
    "cemented": "1"
}

It is strange that after node restart it seems that it continues (block count is increasing again) but this is probably because of lazy bootstrap.

zhyatt commented 3 years ago

@wezrule Can you review the above crash when you get a chance?

wezrule commented 3 years ago

how much free disk space do you have?

regfaker commented 3 years ago

enough 115G

wezrule commented 3 years ago

Can you run docker exec ${NANO_NAME} nano_node --debug_generate_crash_report > crash_report.txt and post the contents of that file please (where ${NANO_NAME} is the name of the docker container)

regfaker commented 3 years ago

well that will create zero sized crash_report.txt but I'm attaching all the nanonode* files from nano directory nano_node_crash_files.zip

wezrule commented 3 years ago

Ah yeh that command outputs to a file nano_node_crash_report.txt not to standard output which you've attached thanks.

wezrule commented 3 years ago

So we've had this reported once or something similar at least on discord, but I don't think there was a resolution, it was suspected hardware failure. To add extra debugging information in the future when asserting (mainly the error code), I've made this PR https://github.com/nanocurrency/nano-node/pull/3077. I've backported just the unchecked change which is where the crash for you is to a v21.2 branch and @argakiig has created a docker tag called nanocurrency/nano:v21.2-release-assert-patch. If you want to try it, then the next time you hit this there should be an output "Error: <error message>". It may help diagnose the issue. Otherwise you may want to try using the RocksDB backend when v22 comes out if you still have issues.

regfaker commented 3 years ago

Ok, thanks, I'll try with that image and bring here error message. I'm looking forward to v22.

regfaker commented 3 years ago

error message enhanced log:

Assertion (false) failed
void nano::block_store_partial<Val, Derived_Store>::unchecked_put(const nano::write_transaction&, const nano::unchecked_key&, const nano::unchecked_info&) [with Val = MDB_val; Derived_Store = nano::mdb_store]
/tmp/src/nano/secure/blockstore_partial.hpp:512
Error: MDB_PAGE_FULL: Internal error - page has no more space

 0# 0x00000000008C5313 in nano_node
 1# 0x00000000008C65D8 in nano_node
 2# 0x00000000007CC82B in nano_node
 3# 0x0000000000512CDB in nano_node
 4# 0x0000000000514D47 in nano_node
 5# 0x00000000005157E7 in nano_node
 6# 0x0000000000D02B00 in nano_node
 7# 0x00007FCDB7D496BA in /lib/x86_64-linux-gnu/libpthread.so.0
 8# clone in /lib/x86_64-linux-gnu/libc.so.6

There are 4G of RAM on host machine if this is mem space related.

wezrule commented 3 years ago

Ok thanks for testing it! This error actually looks related to this: https://bugs.openldap.org/show_bug.cgi?id=8969 It was fixed in LMDB 0.9.24, but v21 is on 0.9.23. We are upgrading to 0.9.25 in v22, will get it backported to a v21 branch and would be good to see if it solves your problem.

regfaker commented 3 years ago

Ok, I will follow after release. Does this happen only in case of full sync? I'm wondering why other running nodes does not have this problem. Is possible workaround in downloading ledger snapshot and run node with it?

wezrule commented 3 years ago

You can try the latest snapshot sure, but there will still be some bootstrapping done after, maybe you get lucky but could still hit it in the future. It seems to be a bug specific to the LMDB library and your hardware seems to trigger it more often. We will get a docker tag for v21 for you soon (might be early next week as it's new years weekend now), hopefully it resolves it.

wezrule commented 3 years ago

@regfaker Could you try nanocurrency/nano:v21.2-release-assert-patch-lmdb-0.9.25 (thanks @argakiig for making this)

regfaker commented 3 years ago

I'll try tomorrow, hopefully. But with used ledger snapshot so far so good - 28 hours without this error. So to simulate it I'll have to delete ledger and start over.

regfaker commented 3 years ago

Unfortunately it is still happening in LMDB 0.9.25. I've tried to clear ledger and after cca hour it fails with almost same error (blockstore_partial.hpp:200 instead of previous blockstore_partial.hpp:512).

Assertion (false) failed
void nano::block_store_partial<Val, Derived_Store>::unchecked_put(const nano::write_transaction&, const nano::unchecked_key&, const nano::unchecked_info&) [with Val = MDB_val; Derived_Store = nano::mdb_store]
/tmp/src/nano/secure/blockstore_partial.hpp:200
Error: MDB_PAGE_FULL: Internal error - page has no more space

nano_node-block_store_partial_unchecked_put-lmdb_0_9_25.log nano_node_crash_report.txt

I will switch to ledger snapshot again, that is working so far.

wezrule commented 3 years ago

Thanks for testing! So I found this Monero Issue where the LMDB creator increased the number of keys for the fine-grained splitpoint search to solve the same error. Maybe increasing it further solves the issue for you, @argakiig has made the docker tag v21.2-release-assert-patch-lmdb-0.9.25-40nkeys where we increase it to 40, if you get a chance would be to know if it fixes it.

regfaker commented 3 years ago

Still happening. Maybe more than 40? I don't know. nano_node-block_store_partial_unchecked_put-lmdb_0_9_25-40nkeys.log nano_node_crash_report.txt

regfaker commented 3 years ago

I'm just attaching another case of failed assertion from today:

[2021-Jan-27 09:51:21.955665]: Processed 22494 votes in 27495 milliseconds (rate of 1227 votes per second)
Assertion (success (status) || not_found (status)) failed
boost::optional<nano::db_val<Val> > nano::block_store_partial<Val, Derived_Store>::block_raw_get_by_type(const nano::transaction&, const nano::block_hash&, nano::block_type&) const [with Val = MDB_val; Derived_Store = nano::mdb_store]
/tmp/src/nano/secure/blockstore_partial.hpp:967

[2021-Jan-27 09:51:26.136522]: Processed 24237 votes in 11592 milliseconds (rate of 3136 votes per second)
 0# 0x00000000008C5313 in nano_node
 1# 0x00000000008C65D8 in nano_node
 2# 0x00000000007C558C in nano_node
 3# 0x00000000007C5ED4 in nano_node
 4# 0x00000000008592F8 in nano_node
 5# 0x0000000000856FD2 in nano_node
 6# 0x0000000000512168 in nano_node
 7# 0x0000000000514D47 in nano_node
 8# 0x00000000005157E7 in nano_node
 9# 0x0000000000D02AF0 in nano_node
10# 0x00007F28EF4766BA in /lib/x86_64-linux-gnu/libpthread.so.0
11# clone in /lib/x86_64-linux-gnu/libc.so.6
LunNova commented 3 years ago

Ran into this this morning.

2021-03-13T08:13:56.154848169Z services_nanonode.1.m6itzkyzhyjl@docker-1    | Assertion (success (status)) failed
2021-03-13T08:13:56.155216877Z services_nanonode.1.m6itzkyzhyjl@docker-1    | void nano::block_store_partial<Val, Derived_Store>::unchecked_put(const nano::write_transaction&, const nano::unchecked_key&, const nano::unchecked_info&) [with Val = MDB_val; Derived_Store = nano::mdb_store]
2021-03-13T08:13:56.155252587Z services_nanonode.1.m6itzkyzhyjl@docker-1    | /tmp/src/nano/secure/blockstore_partial.hpp:506
2021-03-13T08:13:56.155257517Z services_nanonode.1.m6itzkyzhyjl@docker-1    |
2021-03-13T08:13:56.159043076Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  0# 0x00000000008BED83 in nano_node
2021-03-13T08:13:56.159077486Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  1# 0x00000000008C0024 in nano_node
2021-03-13T08:13:56.159082096Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  2# 0x00000000007C8D55 in nano_node
2021-03-13T08:13:56.159085226Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  3# 0x0000000000512916 in nano_node
2021-03-13T08:13:56.159088356Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  4# 0x0000000000514BD7 in nano_node
2021-03-13T08:13:56.159091326Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  5# 0x0000000000515677 in nano_node
2021-03-13T08:13:56.159094096Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  6# 0x0000000000CFC510 in nano_node
2021-03-13T08:13:56.159096596Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  7# 0x00007F93435536BA in /lib/x86_64-linux-gnu/libpthread.so.0
2021-03-13T08:13:56.159099966Z services_nanonode.1.m6itzkyzhyjl@docker-1    |  8# clone in /lib/x86_64-linux-gnu/libc.so.6
LunNova commented 3 years ago

Hit another which is also in this thread, although since different assertion failure not sure is related.

2021-03-14T02:38:12.406693434Z services_nanonode.1.yc28iyb450tn@docker-1    | Assertion (success (status)) failed
2021-03-14T02:38:12.406750233Z services_nanonode.1.yc28iyb450tn@docker-1    | void nano::block_store_partial<Val, Derived_Store>::block_raw_put(const nano::write_transaction&, const std::vector<unsigned char>&, nano::block_type, const nano::block_hash&) [with Val = MDB_val; Derived_Store = nano::mdb_store]
2021-03-14T02:38:12.406766413Z services_nanonode.1.yc28iyb450tn@docker-1    | /tmp/src/nano/secure/blockstore_partial.hpp:445
zhyatt commented 3 years ago

@cryptocode Any chance you are familiar with any of these issue areas?

cryptocode commented 3 years ago

From the source code/docs:

MDB_PAGE_FULL - there is insufficient room in the page. This error
should never happen since all callers already calculate the
page's free space before calling this function.

@zhyatt Given the above, and how it was worked around in https://github.com/monero-project/monero/pull/5112#issuecomment-505481597, I would say this is almost certainly an lmdb bug. The same workaround was tried, but it still hit. I'll try to conjure up a way to repro this, and then see if further increases of nkeys is helpful.

LunNova commented 3 years ago

Thanks for looking at it cryptocode. As I saw this existing issue, I figured it wasn't a problem unique to my setup and was worth reporting it's still happening. Given the move to rocksdb soon and since few people are seeing it, not so sure it needs looking into? Unless more nodes are having trouble and just not posting about it here.

Updated to nanocurrency/nano:v21.3RC1 today.

cryptocode commented 3 years ago

Given the move to rocksdb soon and since few people are seeing it, not so sure it needs looking into?

@TransLunarInjection if there's a bug in lmdb - or in the use of it - then it's worth figuring out for a few reasons: I believe lmdb still going to be a backend, and some people may chose to use it + the wallet database is always lmdb + if it's a node-side bug, then it may affect rocksdb as well; such as something clobbering memory (unlikely given the error, but it needs to be checked out) + we'd be helping other projects using lmdb if a bug is found and fixed.

regfaker commented 3 years ago

Thanks for returning to this. So could be docker image (similarly to v21.2-release-assert-patch-lmdb-0.9.25-40nkeys but for 21.3) with more than 40 nkeys created to let us try if increase helps?