Syncing gets stuck around block >~2450000

tryphe commented 2 years ago

Describe the problem

Behavior: At some point after starting monerod, ranging from a few minutes to many hours, syncing "gets stuck". Whether synced or unsynced, blocks stop being added to the chain until monerod is restarted. The monerod process continues running and is responsive, however, and exits normally if given HUP etc.

The behavior repeatedly appears for me, in particular, starting around block >~2450000. Not any specific block, just random, but not at any point before ~2450000. If monero is already synced, it takes longer for the behavior to appear, sometimes several hours longer.

There are plenty of connections and new connections being made during the behavior. Monerod gives me two messages like this every few minutes, gives the correct number of total blocks, but is still stuck at the same block where the behavior appeared:

[random.ipv4.address:18080 OUT] Sync data returned a new top block candidate: 2459334 -> 2568414 [Your node is 109080 blocks (5.0 months) behind] 
SYNCHRONIZATION started

Workaround: If I create a cron script to restart monerod every ~15 minutes, it eventually syncs.

A few things about my node and what I've tried:

I've been running monero for over 5 years with no syncing issues, until about a month ago. Minimal setup Debian 10 VM with ntpd and monerod (Update: tried with a new Debian 11 machine and the same behavior occurs). Syncing works fine until about 95%, or near block 2450000. Removed the entire .bitmonero directory and synced from scratch several times with the same result. Removed my config and am running with no arguments to reduce complexity, but get the same result. Running on clearnet. Using the blocklist at https://gui.xmr.pm/files/block.txt, but I tried without the blocklist and get the same result. Using outbound connections only to try and sync currently to avoid possibility of attack. Tried running with things like --out-peers 25 or --out-peers 50 but get the same result. Running the latest master branch 9aab19f349433687c7aaf2c1cbc5751e5912c0aa but also tried with signed releases:

0.17.3.0
0.17.2.0
0.17.1.9
0.17.1.7
0.17.1.5
0.17.1.3

which experience the same behavior but do NOT show errors in the log.

Update: I originally thought the syncing problem was related to these errors, but this is not the case. Will leave this section here

Some errors/warnings that frequently occur:

ERROR   net     contrib/epee/include/net/abstract_tcp_server2.inl:777   Setting timer on a shut down object

ERROR   net.cn  src/cryptonote_protocol/cryptonote_protocol_handler.inl:1226    [3.230.147.95:18080 OUT] sent wrong NOTIFY_RESPONSE_GET_OBJECTS: block with id=cffef4cadcddc0a72007c8c07e1a5cce4bcd41b2e7fa1013aa70de02b96b1208 wasn't requested, dropping connection 

WARNING net.p2p src/p2p/net_node.inl:1173       [209.222.252.101:18180 OUT] COMMAND_HANDSHAKE invoke failed. (-4, LEVIN_ERROR_CONNECTION_TIMEDOUT)

There's also a loop that produces hundreds of these at once for the same peer:

2022-02-27 03:16:49.624 I [45.154.254.133:18080 OUT] [0] state: stopping adding blocks in state synchronizing
2022-02-27 03:16:49.624 I [45.154.254.133:18080 OUT] [0] state: resuming in state synchronizing
2022-02-27 03:16:49.625 I [45.154.254.133:18080 OUT] [0] state: will try to add blocks next in state synchronizing
2022-02-27 03:16:49.625 I [45.154.254.133:18080 OUT] [0] state: adding blocks in state synchronizing
2022-02-27 03:16:49.626 I [45.154.254.133:18080 OUT]  parent was requested, we'll get back to it

Here's a short run: https://gist.githubusercontent.com/tryphe/0a09526e0c9aa1e6154b4e2b0c6e2e13/raw/a56cca8895370d6e91a27b5f4a4b887d4e064d2d/gistfile1.txt (Note: I gave the hangup signal near 03:16:52.049 where it shuts down cleanly)

Any feedback is welcome and appreciated. Let me know if you need more information or if I can do anything to make debugging easier. Thank you.

Regloom commented 2 years ago

Just tried to perform full blockchain sync on a new SSD drive with no luck. Stuck on Height: 2501169/2585003 (96.8%) on mainnet.

selsta commented 2 years ago

I'll move the problematic VM to a newer Intel hypervisor to attempt to rule out any hardware issue. Will try to report back in a few days.

Thanks. While I can't imagine the CPU being an issue, it still helps us isolate the sync bug.

selsta commented 2 years ago

One more thing to test: Can you compile https://github.com/tevador/RandomX and then run the ./randomx-tests binary to see if all tests pass without issues?

Gingeropolous commented 2 years ago

for the record i just did a fresh sync on a 5900x on ubuntu 20 w/ 16gb ram.

Regloom commented 2 years ago

One more thing to test: Can you compile https://github.com/tevador/RandomX and then run the ./randomx-tests binary to see if all tests pass without issues?

Done, here is a clipped ./random-tests output:

[88] Hash test 2e (compiler) ... PASSED [89] Cache initialization: SSSE3 ... SKIPPED [90] Cache initialization: AVX2 ... SKIPPED [91] Hash batch test ... PASSED [92] Preserve rounding mode ... PASSED

All tests PASSED 2 tests were SKIPPED due to incompatible configuration (see above)

I've created a systemctl timer to restart monerod service every 30 mins as a workaround, at least it syncs somehow (slowly)

moneromooo-monero commented 2 years ago

How long did that randomx test last ?

Also try running monerod with --prep-blocks-threads 1

Regloom commented 2 years ago

How long did that randomx test last ?

Also try running monerod with --prep-blocks-threads 1

It took less than a minute. Will do.

moneromooo-monero commented 2 years ago

Check for any options for long term tests. It's possible your CPU starts going wonky once hot.

Regloom commented 2 years ago

Check for any options for long term tests. It's possible your CPU starts going wonky once hot.

Checked everything related to CPU (temp, freq) and it's OK. My blockchain synced to 100% in 3 days having "monerod" service restarted every 30 mins. Looks like a workaround for an issue.

tryphe commented 2 years ago

Just a quick update

The VM on the AMD machine is synced but still exhibits the strange behavior usually every few hours, but sometimes takes longer, 20-30 hours. The VM was cloned and has been running fine for 10 days on an i7 machine. Same network for both machines, so it looks like it's not network related.

Although I didn't pop any blocks to see if it could sync from farther back on the i7 machine. But I'm assuming the cause of the behavior while synced and unsynced are identical because the behavior is the same.

I recently ran prime95 (random FFT) for a while on the AMD machine, so it's definitely stable. It never runs hotter than 30C. I ran the RandomX tests and everything does pass fine. It also runs other VMs and various software with no strange behavior. So I'm not really sure what's going on.

I think I have an old Athlon II X4 around here somewhere :D It's practically the same CPU on a slightly older socket and different chipset.

tryphe commented 2 years ago

I updated the OP so it should be a bit more clear what the behavior is now.

selsta commented 2 years ago

Can you run RandomX tests in a loop for 60 hours or so and see if it passes 100% of the time? That would be the most similar test to running monerod for a long time.

Gingeropolous commented 2 years ago

I just completed a sync on a laptop made in 2007 with a Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz with 2GB RAM and a 7200 rpm spinny HDD

agowa commented 2 years ago

I ran into this issue too. I spent a few hours trying to debug it. Apparently, it crashes when trying to get the length of some string. I haven't found which one exactly it is, as when I try to debug it, but the strlen cal is way too often called for that to be feasible. Also, I tried to compile from the master branch, but that failed with a bunch of C++ errors in multiple libraries (cross-compiling for windows 64-bit from ubuntu). But back to the topic.

The error shows up as an "FILE SYSTEM LIMITATION" in procmon.

selsta commented 2 years ago

@agowa338 when did this issue start showing up? could it be a corrupted database?

agowa commented 2 years ago

I dought that it's a corrupt database. After all it started syncing from noting yesterday.

selsta commented 2 years ago

And it instantly crashes after startup?

agowa commented 2 years ago

Not instantly, but immediately after the "SYNCHRONIZATION started" message. I tried to run it with increased loglevel, but even with 4 I couldn't see anything related to this (or I missed it, it was a lot of log output).

selsta commented 2 years ago

I assume starting with --db-salvage doesn't help, but it's worth a try.

If it doesn't help I would move / delete the blockchain (data.lmdb) and try to sync from scratch again. The blockchain can corrupt when there is a power loss or when the external harddrive gets disconnected during sync.

You can also add --db-sync-mode safe which makes the sync process slower but it helps against corruption. This isn't necessary usually but Windows is a bit more error prone here.

agowa commented 2 years ago

If it doesn't help I would move / delete the blockchain (data.lmdb) and try to sync from scratch again. The blockchain can corrupt when there is a power loss or when the external harddrive gets disconnected during sync.

That wasn't the case. The computer was still running. All that happened was while the initial sync it already crashed at that position. And now every time I restart it crashes there again. And the harddisk is an NVME where also the OS is installed on, so it is probably also not that (no disk errors within the eventlog either). I can delete it and restart the sync, but as it was basically the initial sync, I have my doubts that it'll work the 2nd time...

selsta commented 2 years ago

What you posted shows that there is an error when adding something to the database. Now I'm not really sure what "FILE SYSTEM LIMITATION" means exactly, but if it starts showing up a second time we can be sure that it's not a corrupted database.

@hyc any idea about ^ ?

selsta commented 2 years ago

To clarify, with what you described (internal NVMe, no power loss) it's unlikely that your database is corrupted but it's the easiest first step to help us locate the issue.

agowa commented 2 years ago

"FILE SYSTEM LIMITATION" is the result of the API call to SetEndOfFile https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-setendoffile which sets the physical size of a file (There is still about 156 GB free on that 1TB drive, so it's not an out of space issue either).

To clarify, with what you described (internal NVMe, no power loss) it's unlikely that your database is corrupted but it's the easiest first step to help us locate the issue.

Ok, going to delete the database now. But it'll take about a day to sync. So that's why I didn't necessarily want to do it now while debugging.

selsta commented 2 years ago

And do you know how to get an all threads backtrace on Windows? I'm unfortunately only familiar with gdb / lldb.

agowa commented 2 years ago

Only in visual studio. Haven't done it in the x64dbg or windbg yet.

agowa commented 2 years ago

Here you go, the horizontal lines separate threads:

selsta commented 2 years ago

This isn't readable unfortunately, seems to be either a stripped binary or some Windows specific thing.

agowa commented 2 years ago

Those are the exact positions where the instruction pointer was currently pointing to. Sadly it doesn't show the function name. For that I have to click on each individual one and look a few lines up in the asembly...

agowa commented 2 years ago

Apparently, that was my fault. I didn't download the symbols within x64dbg. This already looks a bit better (but I don't have a PDB file for monerod (it's not within the downloadable zip), so it cannot resolve these)

hyc commented 2 years ago

What you posted shows that there is an error when adding something to the database. Now I'm not really sure what "FILE SYSTEM LIMITATION" means exactly, but if it starts showing up a second time we can be sure that it's not a corrupted database.

@hyc any idea about ^ ?

No idea how strlen() has anything to do with filesystem limitations. What is the size of the data.mdb file, and what kind of filesystem is in use on that partition?

agowa commented 2 years ago

Filesystem NTFS data.mdb size: 113 GB (122.395.095.040 bytes)

the strlen() is called once I hit "step over", so it's probably the exception handler within monerod I suppose (hard to tell without the PDB file).

agowa commented 2 years ago

And when looking at the asm code of strlen we can see why it crashes. Somehow it ends up with a null pointer in RAX and tries to read from it. And reading from a null pointer aparently throws an access violation exception...

agowa commented 2 years ago

I deleted the database and let it resync again (this time with --db-sync-mode safe) and it worked. I don't know why it failed at the first initial sync. But the 2nd initial sync worked.

selsta commented 3 months ago

Closing as there were no other reports about this in a while and I suspect the issue was hardware related.

monero-project / monero

Syncing gets stuck around block >~2450000 #8194