Closed tryphe closed 3 months ago
Just tried to perform full blockchain sync on a new SSD drive with no luck. Stuck on Height: 2501169/2585003 (96.8%) on mainnet.
I'll move the problematic VM to a newer Intel hypervisor to attempt to rule out any hardware issue. Will try to report back in a few days.
Thanks. While I can't imagine the CPU being an issue, it still helps us isolate the sync bug.
One more thing to test: Can you compile https://github.com/tevador/RandomX and then run the ./randomx-tests
binary to see if all tests pass without issues?
for the record i just did a fresh sync on a 5900x on ubuntu 20 w/ 16gb ram.
One more thing to test: Can you compile https://github.com/tevador/RandomX and then run the
./randomx-tests
binary to see if all tests pass without issues?
Done, here is a clipped ./random-tests
output:
[88] Hash test 2e (compiler) ... PASSED [89] Cache initialization: SSSE3 ... SKIPPED [90] Cache initialization: AVX2 ... SKIPPED [91] Hash batch test ... PASSED [92] Preserve rounding mode ... PASSED
All tests PASSED 2 tests were SKIPPED due to incompatible configuration (see above)
I've created a systemctl timer to restart monerod service every 30 mins as a workaround, at least it syncs somehow (slowly)
How long did that randomx test last ?
Also try running monerod with --prep-blocks-threads 1
How long did that randomx test last ?
Also try running monerod with --prep-blocks-threads 1
It took less than a minute. Will do.
Check for any options for long term tests. It's possible your CPU starts going wonky once hot.
Check for any options for long term tests. It's possible your CPU starts going wonky once hot.
Checked everything related to CPU (temp, freq) and it's OK. My blockchain synced to 100% in 3 days having "monerod" service restarted every 30 mins. Looks like a workaround for an issue.
Just a quick update
The VM on the AMD machine is synced but still exhibits the strange behavior usually every few hours, but sometimes takes longer, 20-30 hours. The VM was cloned and has been running fine for 10 days on an i7 machine. Same network for both machines, so it looks like it's not network related.
Although I didn't pop any blocks to see if it could sync from farther back on the i7 machine. But I'm assuming the cause of the behavior while synced and unsynced are identical because the behavior is the same.
I recently ran prime95 (random FFT) for a while on the AMD machine, so it's definitely stable. It never runs hotter than 30C. I ran the RandomX tests and everything does pass fine. It also runs other VMs and various software with no strange behavior. So I'm not really sure what's going on.
I think I have an old Athlon II X4 around here somewhere :D It's practically the same CPU on a slightly older socket and different chipset.
I updated the OP so it should be a bit more clear what the behavior is now.
Can you run RandomX tests in a loop for 60 hours or so and see if it passes 100% of the time? That would be the most similar test to running monerod for a long time.
I just completed a sync on a laptop made in 2007 with a Intel(R) Core(TM)2 Duo CPU T8100 @ 2.10GHz with 2GB RAM and a 7200 rpm spinny HDD
I ran into this issue too. I spent a few hours trying to debug it. Apparently, it crashes when trying to get the length of some string. I haven't found which one exactly it is, as when I try to debug it, but the strlen cal is way too often called for that to be feasible. Also, I tried to compile from the master branch, but that failed with a bunch of C++ errors in multiple libraries (cross-compiling for windows 64-bit from ubuntu). But back to the topic.
The error shows up as an "FILE SYSTEM LIMITATION" in procmon.
@agowa338 when did this issue start showing up? could it be a corrupted database?
I dought that it's a corrupt database. After all it started syncing from noting yesterday.
And it instantly crashes after startup?
Not instantly, but immediately after the "SYNCHRONIZATION started" message. I tried to run it with increased loglevel, but even with 4 I couldn't see anything related to this (or I missed it, it was a lot of log output).
I assume starting with --db-salvage
doesn't help, but it's worth a try.
If it doesn't help I would move / delete the blockchain (data.lmdb) and try to sync from scratch again. The blockchain can corrupt when there is a power loss or when the external harddrive gets disconnected during sync.
You can also add --db-sync-mode safe
which makes the sync process slower but it helps against corruption. This isn't necessary usually but Windows is a bit more error prone here.
If it doesn't help I would move / delete the blockchain (data.lmdb) and try to sync from scratch again. The blockchain can corrupt when there is a power loss or when the external harddrive gets disconnected during sync.
That wasn't the case. The computer was still running. All that happened was while the initial sync it already crashed at that position. And now every time I restart it crashes there again. And the harddisk is an NVME where also the OS is installed on, so it is probably also not that (no disk errors within the eventlog either). I can delete it and restart the sync, but as it was basically the initial sync, I have my doubts that it'll work the 2nd time...
What you posted shows that there is an error when adding something to the database. Now I'm not really sure what "FILE SYSTEM LIMITATION" means exactly, but if it starts showing up a second time we can be sure that it's not a corrupted database.
@hyc any idea about ^ ?
To clarify, with what you described (internal NVMe, no power loss) it's unlikely that your database is corrupted but it's the easiest first step to help us locate the issue.
"FILE SYSTEM LIMITATION" is the result of the API call to SetEndOfFile https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-setendoffile which sets the physical size of a file (There is still about 156 GB free on that 1TB drive, so it's not an out of space issue either).
To clarify, with what you described (internal NVMe, no power loss) it's unlikely that your database is corrupted but it's the easiest first step to help us locate the issue.
Ok, going to delete the database now. But it'll take about a day to sync. So that's why I didn't necessarily want to do it now while debugging.
And do you know how to get an all threads backtrace on Windows? I'm unfortunately only familiar with gdb / lldb.
Only in visual studio. Haven't done it in the x64dbg or windbg yet.
Here you go, the horizontal lines separate threads:
This isn't readable unfortunately, seems to be either a stripped binary or some Windows specific thing.
Those are the exact positions where the instruction pointer was currently pointing to. Sadly it doesn't show the function name. For that I have to click on each individual one and look a few lines up in the asembly...
Apparently, that was my fault. I didn't download the symbols within x64dbg. This already looks a bit better (but I don't have a PDB file for monerod (it's not within the downloadable zip), so it cannot resolve these)
What you posted shows that there is an error when adding something to the database. Now I'm not really sure what "FILE SYSTEM LIMITATION" means exactly, but if it starts showing up a second time we can be sure that it's not a corrupted database.
@hyc any idea about ^ ?
No idea how strlen() has anything to do with filesystem limitations. What is the size of the data.mdb file, and what kind of filesystem is in use on that partition?
Filesystem NTFS data.mdb size: 113 GB (122.395.095.040 bytes)
the strlen() is called once I hit "step over", so it's probably the exception handler within monerod I suppose (hard to tell without the PDB file).
And when looking at the asm code of strlen we can see why it crashes. Somehow it ends up with a null pointer in RAX and tries to read from it. And reading from a null pointer aparently throws an access violation exception...
I deleted the database and let it resync again (this time with --db-sync-mode safe
) and it worked.
I don't know why it failed at the first initial sync. But the 2nd initial sync worked.
Closing as there were no other reports about this in a while and I suspect the issue was hardware related.
Describe the problem
Behavior: At some point after starting monerod, ranging from a few minutes to many hours, syncing "gets stuck". Whether synced or unsynced, blocks stop being added to the chain until monerod is restarted. The monerod process continues running and is responsive, however, and exits normally if given HUP etc.
The behavior repeatedly appears for me, in particular, starting around block >~2450000. Not any specific block, just random, but not at any point before ~2450000. If monero is already synced, it takes longer for the behavior to appear, sometimes several hours longer.
There are plenty of connections and new connections being made during the behavior. Monerod gives me two messages like this every few minutes, gives the correct number of total blocks, but is still stuck at the same block where the behavior appeared:
Workaround: If I create a cron script to restart monerod every ~15 minutes, it eventually syncs.
A few things about my node and what I've tried:
I've been running monero for over 5 years with no syncing issues, until about a month ago. Minimal setup Debian 10 VM with ntpd and monerod (Update: tried with a new Debian 11 machine and the same behavior occurs). Syncing works fine until about 95%, or near block 2450000. Removed the entire .bitmonero directory and synced from scratch several times with the same result. Removed my config and am running with no arguments to reduce complexity, but get the same result. Running on clearnet. Using the blocklist at https://gui.xmr.pm/files/block.txt, but I tried without the blocklist and get the same result. Using outbound connections only to try and sync currently to avoid possibility of attack. Tried running with things like
--out-peers 25
or--out-peers 50
but get the same result. Running the latest master branch 9aab19f349433687c7aaf2c1cbc5751e5912c0aa but also tried with signed releases:which experience the same behavior but do NOT show errors in the log.
Update: I originally thought the syncing problem was related to these errors, but this is not the case. Will leave this section here
Some errors/warnings that frequently occur:
There's also a loop that produces hundreds of these at once for the same peer:
Here's a short run: https://gist.githubusercontent.com/tryphe/0a09526e0c9aa1e6154b4e2b0c6e2e13/raw/a56cca8895370d6e91a27b5f4a4b887d4e064d2d/gistfile1.txt (Note: I gave the hangup signal near 03:16:52.049 where it shuts down cleanly)
Any feedback is welcome and appreciated. Let me know if you need more information or if I can do anything to make debugging easier. Thank you.