monero-project / monero

Monero: the secure, private, untraceable cryptocurrency
https://getmonero.org
Other
8.92k stars 3.1k forks source link

Syncing gets stuck around block >~2450000 #8194

Closed tryphe closed 2 months ago

tryphe commented 2 years ago

Describe the problem

Behavior: At some point after starting monerod, ranging from a few minutes to many hours, syncing "gets stuck". Whether synced or unsynced, blocks stop being added to the chain until monerod is restarted. The monerod process continues running and is responsive, however, and exits normally if given HUP etc.

The behavior repeatedly appears for me, in particular, starting around block >~2450000. Not any specific block, just random, but not at any point before ~2450000. If monero is already synced, it takes longer for the behavior to appear, sometimes several hours longer.

There are plenty of connections and new connections being made during the behavior. Monerod gives me two messages like this every few minutes, gives the correct number of total blocks, but is still stuck at the same block where the behavior appeared:

[random.ipv4.address:18080 OUT] Sync data returned a new top block candidate: 2459334 -> 2568414 [Your node is 109080 blocks (5.0 months) behind] 
SYNCHRONIZATION started 

Workaround: If I create a cron script to restart monerod every ~15 minutes, it eventually syncs.

A few things about my node and what I've tried:

I've been running monero for over 5 years with no syncing issues, until about a month ago. Minimal setup Debian 10 VM with ntpd and monerod (Update: tried with a new Debian 11 machine and the same behavior occurs). Syncing works fine until about 95%, or near block 2450000. Removed the entire .bitmonero directory and synced from scratch several times with the same result. Removed my config and am running with no arguments to reduce complexity, but get the same result. Running on clearnet. Using the blocklist at https://gui.xmr.pm/files/block.txt, but I tried without the blocklist and get the same result. Using outbound connections only to try and sync currently to avoid possibility of attack. Tried running with things like --out-peers 25 or --out-peers 50 but get the same result. Running the latest master branch 9aab19f349433687c7aaf2c1cbc5751e5912c0aa but also tried with signed releases:

0.17.3.0
0.17.2.0
0.17.1.9
0.17.1.7
0.17.1.5
0.17.1.3

which experience the same behavior but do NOT show errors in the log.

Update: I originally thought the syncing problem was related to these errors, but this is not the case. Will leave this section here

Some errors/warnings that frequently occur:

ERROR   net     contrib/epee/include/net/abstract_tcp_server2.inl:777   Setting timer on a shut down object

ERROR   net.cn  src/cryptonote_protocol/cryptonote_protocol_handler.inl:1226    [3.230.147.95:18080 OUT] sent wrong NOTIFY_RESPONSE_GET_OBJECTS: block with id=cffef4cadcddc0a72007c8c07e1a5cce4bcd41b2e7fa1013aa70de02b96b1208 wasn't requested, dropping connection 

WARNING net.p2p src/p2p/net_node.inl:1173       [209.222.252.101:18180 OUT] COMMAND_HANDSHAKE invoke failed. (-4, LEVIN_ERROR_CONNECTION_TIMEDOUT)

There's also a loop that produces hundreds of these at once for the same peer:

2022-02-27 03:16:49.624 I [45.154.254.133:18080 OUT] [0] state: stopping adding blocks in state synchronizing
2022-02-27 03:16:49.624 I [45.154.254.133:18080 OUT] [0] state: resuming in state synchronizing
2022-02-27 03:16:49.625 I [45.154.254.133:18080 OUT] [0] state: will try to add blocks next in state synchronizing
2022-02-27 03:16:49.625 I [45.154.254.133:18080 OUT] [0] state: adding blocks in state synchronizing
2022-02-27 03:16:49.626 I [45.154.254.133:18080 OUT]  parent was requested, we'll get back to it

Here's a short run: https://gist.githubusercontent.com/tryphe/0a09526e0c9aa1e6154b4e2b0c6e2e13/raw/a56cca8895370d6e91a27b5f4a4b887d4e064d2d/gistfile1.txt (Note: I gave the hangup signal near 03:16:52.049 where it shuts down cleanly)

Any feedback is welcome and appreciated. Let me know if you need more information or if I can do anything to make debugging easier. Thank you.

selsta commented 2 years ago

Syncing works fine until about 95%, or near block 2450000.

What exactly happens then? Does it stay stuck on the same block and stop syncing?

Does it start to ban other peers?

tryphe commented 2 years ago

Syncing works fine until about 95%, or near block 2450000.

What exactly happens then? Does it start to ban other peers?

The errors and warnings stated above happen constantly (see gist for details) until I give the hangup signal around 03:16:52.049. No blocks get added to the chain. It doesn't matter if I wait an hour, or 5 minutes, the log looks the same, so I gave a short log.

Indeed, peers are banned.

tryphe commented 2 years ago

One thing I didn't mention, if I continually restart the daemon, it seems to sync a few chunks of 20 blocks upon start, but it only works for a minute or two until getting stuck like in the log.

It's obvious to me that there are some malicious nodes out there just handing me junk blocks. But could it really be an eclipse attack? It seems to me like adding more peers would reduce the chance quite a bit, but I'm not sure the problem here is just that. It seems like maybe a combination of a bug and malicious peers, but that's just speculation.

selsta commented 2 years ago

Which log level is the log you posted?

It's obvious to me that there are some malicious nodes out there just handing me junk blocks. But could it really be an eclipse attack?

You appear to have this issue for a month and no one else reported something similar so I don't think so, otherwise there would be more reports about similar issues.

It's probably either an issue with your setup or a bug that gets triggered in your setup, but looking at the logs I don't see anything obvious yet.

tryphe commented 2 years ago

Which log level is the log you posted?

It's obvious to me that there are some malicious nodes out there just handing me junk blocks. But could it really be an eclipse attack?

You appear to have this issue for a month and no one else reported something similar so I don't think so, otherwise there would be more reports about similar issues.

It's probably either an issue with your setup or a bug that gets triggered in your setup, but looking at the logs I don't see anything obvious yet.

Log level 1. Let me know if you want a different log level.

selsta commented 2 years ago

Can you post log level 2?

tryphe commented 2 years ago

Thanks for your quick feedback. I use the same VM image for several coin daemons, and only Monero has syncing problems, but it's the only cryptonote coin I run. I can't really imagine a more generic setup, but after searching around, you're definitely right that no one seems to have this problem.

I'll make a clean VM and see how that does.

Log level 2: https://gist.githubusercontent.com/tryphe/2436c1e1a5ea21d7d3c678bd5567e396/raw/69d6a726a8fe45378bc0e6869cab1a5bbe11cceb/gistfile1.txt

tryphe commented 2 years ago

I did a quick dist-upgrade to Debian 11 without rebuilding Monero and the problem seems to have resolved itself. Really strange. No problems now. Why it happens at 95% synced is beyond me. It doesn't happen at earlier blocks.

It seems that running Debian 10 and other non-new distros should be advised against. Debian 10 is "oldstable" which is still supported, so that's mildly concerning that people are probably using it still.

I'll post some new logs so they can be compared against the ones that failed.

tryphe commented 2 years ago

Log level 2 after the problem was resolved by upgrading to Debian 11: https://gist.githubusercontent.com/tryphe/fac92713a0fab5aa6a75c6ae40905573/raw/c3262979e8e0f299b290bde46ac096c48cd6e144/gistfile1.txt

Thanks again for your feedback @selsta!

I can't spot anything in the log that's different. The warnings and errors look the same. It must be related to some buggy lower level library running on some Linux distributions.

Hopefully this might help someone that has a similar setup but this does not necessary qualify as a Monero issue. Maybe a check for some older library could be enforced, although frankly I have no idea which library it could be :stuck_out_tongue:

selsta commented 2 years ago

Does everything still work? It's unlikely that your OS version is related here.

tryphe commented 2 years ago

@selsta yes. Ran dist-upgrade, did nothing else, and it's been synced ever since. I upgraded from Buster (Debian 10) which was up to date via apt, so there's a problem there.

tryphe commented 2 years ago

It started doing it again today and now the node is behind. Will try and make a new machine. And again, none of my other coin daemons have this problem, so it's a monero problem. But it doesn't really make sense that i was able to sync ~100k blocks fine at a point where it was stuck, and then nothing.

selsta commented 2 years ago
19:02 <+moneromooo> The first few blocks that are needed are requested from a peer that doesn't send them.
19:02 <+moneromooo> The timeout seems to be 25 seconds here.
19:04 <+moneromooo> The blocks then get requested from another peer, but that peer gets dropped after it sends a reply to another message. Possible state machine bug here.
19:31 <+moneromooo> There are two instances of possible large delays in networking. So, who knows.

That's what moneromooo saw from the logs. Now it still doesn't explain why there aren't more reports about this. I guess testing a different machine would be helpful.

selsta commented 2 years ago
23:13 <hocuspocusoutoff> so... possibly not related at all since user's details dont match mine, but i had similar symptoms when i tried my first sync. i am hosting the blockchain on a NAS while daemon runs on my server. the NAS had some goofy-ass software that "indexed" things every time a change was made on disk. this basically caused the sync to grind to a halt somewhere around the same block height user is mentioning. once i was able to turn that
23:13 <hocuspocusoutoff> nonsense off, sync finished cleanly

Not sure if this is helpful for your setup but I thought I'd add it too.

snex commented 2 years ago

Hi @selsta , I had a similar issue when trying to host the blockchain on my NAS. Turned out the problem was the NAS had some hidden service that "reindexed" its files every time ANY change was made on disk. This would lead monerod and this index service to compete for access to the file, which would grind the whole thing to a halt. Once I removed this NAS service, I was able to full sync from 0 in about 1.5 days total, with zero issues.

Could you possibly also have a similar service interfering with your blockchain file?

tryphe commented 2 years ago

I was able to reproduce the bug with a new KVM instance with Debian 11 amd64 (vm software is up to date, host is up to date). Host is running other coin daemons with a similar setup. I tried to get a setup that is most similar to the one producing the bug.

Steps I took:

  1. Install fresh Debian 11 netinst amd64. Ext4 partition, auto format, default options. Unchecked all debian features except xfce (Edit: after trying without xfce below, the bug seems unrelated to X, so choose which you prefer).
  2. Configure static ip via /etc/network/interfaces
  3. Remove crap: apt remove network-manager network-manager-gnome light-locker xiccd pipewire colord system-config-printer gvfs gpg-agent tumbler at-spi2-core geoclue-2.0 policykit-1-gnome
  4. apt autoremove
  5. Reboot and make sure no extra crap is running besides basic system processes and lightdm child processes.
  6. Build Monero from source via instructions from README. In the OP I built via docker. This time I built without docker and still got the bug.
  7. Execute ./monerod and let it run. It will get stuck near block 2400000 to 2500000.

$ gcc -v: gcc version 10.2.1 20210110 (Debian 10.2.1-6)

$ uname -a: Linux debian11 5.10.0-11-amd64 #1 SMP Debian 5.10.92-1 (2022-01-18) x86_64 GNU/Linux

$ ps -aux (besides kernel processes) :

/sbin/init
/lib/systemd/systemd-journald
/lib/systemd/systemd-udevd
/usr/sbin/cron
/usr/bin/dbus-daemon
/usr/sbin/rsyslogd
/lib/systemd/systemd-logind
/usr/libexec/udisks2/udisksd
/usr/libexec/polkitd
/usr/sbin/lightdm
/usr/lib/xorg/Xorg
/sbin/agetty
/lib/systemd/systemd
(sd-pam)
xfce4-session
/usr/bin/dbus-daemon
/usr/bin/ssh-agent
/usr/lib/x86_64-linux-gnu/xfce4/xfconf/xfconfd
xfwm4
xfsettingsd
xfce4-panel
Thunar --daemon
xfdesktop
/usr/lib/x86_64-linux-gnu/xfce4/notifyd/xfce4-notifyd
/usr/lib/x86_64-linux-gnu/xfce4/panel/wrapper-2.0 /usr/lib/x86_64-linux-gnu/xfce4/panel/plugins/libsystray.so
/usr/lib/x86_64-linux-gnu/xfce4/panel/wrapper-2.0 /usr/lib/x86_64-linux-gnu/xfce4/panel/plugins/libnotification-plugin.so
/usr/lib/x86_64-linux-gnu/xfce4/panel/wrapper-2.0 /usr/lib/x86_64-linux-gnu/xfce4/panel/plugins/libactions.so
xterm
bash
./monerod
/usr/libexec/dconf-service

It got all the way up to and stuck at block ~2508000 for hours, with many peers and lots of network traffic. Restarted the daemon, and it's syncing okay again for a few minutes. Same issue as in OP. We'll see how long until it gets stuck again.

I'll try to see if I can reproduce without x running (sudo service lightdm stop and switch to another tty)

selsta commented 2 years ago

Is the blockchain stored inside the VM or on an external network drive?

tryphe commented 2 years ago

Is the blockchain stored inside the VM or on an external network drive?

Could you possibly also have a similar service interfering with your blockchain file?

The blockchain is always stored locally and no processes are running that seem like they could interfere. Most processes have a fraction of a second of total cpu time used.

tryphe commented 2 years ago

The steps taken above were an effort to try and reproduce the bug on a different VM but on an identical setup. The problem still occurred.

I've taken some additional steps to reduce the complexity and was able to reproduce the bug without X or related processes running ie. xfcond and I was still able to reproduce the bug.

ss -t gives me 13 outbound peers nload tells me there is around 0.5-1mbps burst and <0.5 average traffic, as opposed to >5mbps average while syncing blocks successfully.

ps aux gives me (after I closed monerod):

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.2  98204 10216 ?        Ss   17:30   0:04 /sbin/init
root           2  0.0  0.0      0     0 ?        S    17:30   0:00 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<   17:30   0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<   17:30   0:00 [rcu_par_gp]
root           6  0.0  0.0      0     0 ?        I<   17:30   0:00 [kworker/0:0H-events_highpri]
root           8  0.0  0.0      0     0 ?        I<   17:30   0:00 [mm_percpu_wq]
root           9  0.0  0.0      0     0 ?        S    17:30   0:00 [rcu_tasks_rude_]
root          10  0.0  0.0      0     0 ?        S    17:30   0:00 [rcu_tasks_trace]
root          11  0.0  0.0      0     0 ?        S    17:30   0:00 [ksoftirqd/0]
root          12  0.1  0.0      0     0 ?        I    17:30   0:07 [rcu_sched]
root          13  0.0  0.0      0     0 ?        S    17:30   0:00 [migration/0]
root          14  0.0  0.0      0     0 ?        I    17:30   0:00 [kworker/0:1-mm_percpu_wq]
root          15  0.0  0.0      0     0 ?        S    17:30   0:00 [cpuhp/0]
root          16  0.0  0.0      0     0 ?        S    17:30   0:00 [cpuhp/1]
root          17  0.0  0.0      0     0 ?        S    17:30   0:00 [migration/1]
root          18  0.0  0.0      0     0 ?        S    17:30   0:00 [ksoftirqd/1]
root          20  0.0  0.0      0     0 ?        I<   17:30   0:00 [kworker/1:0H-events_highpri]
root          21  0.0  0.0      0     0 ?        S    17:30   0:00 [cpuhp/2]
root          22  0.0  0.0      0     0 ?        S    17:30   0:00 [migration/2]
root          23  0.0  0.0      0     0 ?        S    17:30   0:07 [ksoftirqd/2]
root          25  0.0  0.0      0     0 ?        I<   17:30   0:00 [kworker/2:0H-events_highpri]
root          26  0.0  0.0      0     0 ?        S    17:30   0:00 [cpuhp/3]
root          27  0.0  0.0      0     0 ?        S    17:30   0:00 [migration/3]
root          28  0.0  0.0      0     0 ?        S    17:30   0:00 [ksoftirqd/3]
root          30  0.0  0.0      0     0 ?        I<   17:30   0:00 [kworker/3:0H-events_highpri]
root          34  0.0  0.0      0     0 ?        S    17:30   0:00 [kdevtmpfs]
root          35  0.0  0.0      0     0 ?        I<   17:30   0:00 [netns]
root          36  0.0  0.0      0     0 ?        S    17:30   0:00 [kauditd]
root          37  0.0  0.0      0     0 ?        S    17:30   0:00 [khungtaskd]
root          38  0.0  0.0      0     0 ?        S    17:30   0:00 [oom_reaper]
root          39  0.0  0.0      0     0 ?        I<   17:30   0:00 [writeback]
root          40  0.0  0.0      0     0 ?        S    17:30   0:01 [kcompactd0]
root          41  0.0  0.0      0     0 ?        SN   17:30   0:00 [ksmd]
root          42  0.0  0.0      0     0 ?        SN   17:30   0:02 [khugepaged]
root          63  0.0  0.0      0     0 ?        I<   17:30   0:00 [kintegrityd]
root          64  0.0  0.0      0     0 ?        I<   17:30   0:00 [kblockd]
root          65  0.0  0.0      0     0 ?        I<   17:30   0:00 [blkcg_punt_bio]
root          66  0.0  0.0      0     0 ?        I<   17:30   0:00 [edac-poller]
root          67  0.0  0.0      0     0 ?        I<   17:30   0:00 [devfreq_wq]
root          68  0.1  0.0      0     0 ?        I<   17:30   0:15 [kworker/1:1H-kblockd]
root          69  0.0  0.0      0     0 ?        S    17:30   0:01 [kswapd0]
root          70  0.0  0.0      0     0 ?        I<   17:30   0:00 [kthrotld]
root          71  0.0  0.0      0     0 ?        I<   17:30   0:00 [acpi_thermal_pm]
root          72  0.5  0.0      0     0 ?        I<   17:30   0:40 [kworker/3:1H-kblockd]
root          73  0.0  0.0      0     0 ?        I<   17:30   0:00 [ipv6_addrconf]
root          82  0.0  0.0      0     0 ?        I<   17:30   0:00 [kstrp]
root          85  0.0  0.0      0     0 ?        I<   17:30   0:00 [zswap-shrink]
root          86  0.0  0.0      0     0 ?        I<   17:30   0:00 [kworker/u9:0]
root         135  0.0  0.0      0     0 ?        I<   17:30   0:00 [ata_sff]
root         136  0.0  0.0      0     0 ?        S    17:30   0:00 [scsi_eh_0]
root         137  0.0  0.0      0     0 ?        I<   17:30   0:00 [scsi_tmf_0]
root         138  0.0  0.0      0     0 ?        S    17:30   0:00 [scsi_eh_1]
root         139  0.0  0.0      0     0 ?        I<   17:30   0:00 [scsi_tmf_1]
root         144  0.0  0.0      0     0 ?        S    17:30   0:00 [scsi_eh_2]
root         145  0.0  0.0      0     0 ?        I<   17:30   0:00 [scsi_tmf_2]
root         147  0.0  0.0      0     0 ?        S    17:30   0:02 [irq/18-vmwgfx]
root         148  0.0  0.0      0     0 ?        I<   17:30   0:00 [ttm_swap]
root         149  0.0  0.0      0     0 ?        S    17:30   0:00 [card0-crtc0]
root         150  0.0  0.0      0     0 ?        S    17:30   0:00 [card0-crtc1]
root         151  0.0  0.0      0     0 ?        S    17:30   0:00 [card0-crtc2]
root         152  0.0  0.0      0     0 ?        S    17:30   0:00 [card0-crtc3]
root         153  0.0  0.0      0     0 ?        S    17:30   0:00 [card0-crtc4]
root         154  0.0  0.0      0     0 ?        S    17:30   0:00 [card0-crtc5]
root         155  0.0  0.0      0     0 ?        S    17:30   0:00 [card0-crtc6]
root         156  0.0  0.0      0     0 ?        S    17:30   0:00 [card0-crtc7]
root         157  0.1  0.0      0     0 ?        I<   17:30   0:11 [kworker/2:1H-kblockd]
root         160  0.0  0.0      0     0 ?        I    17:30   0:00 [kworker/2:2-cgroup_destroy]
root         161  0.1  0.0      0     0 ?        I<   17:30   0:11 [kworker/0:1H-kblockd]
root         190  0.0  0.0      0     0 ?        S    17:30   0:00 [jbd2/sda1-8]
root         191  0.0  0.0      0     0 ?        I<   17:30   0:00 [ext4-rsv-conver]
root         229  0.0  0.4  48332 17084 ?        Ss   17:30   0:00 /lib/systemd/systemd-journald
root         251  0.0  0.1  21916  5040 ?        Ss   17:30   0:00 /lib/systemd/systemd-udevd
root         294  0.0  0.0      0     0 ?        I<   17:30   0:00 [iprt-VBoxWQueue]
root         310  0.0  0.0      0     0 ?        I<   17:30   0:00 [cryptd]
root         414  0.0  0.0   6684  2648 ?        Ss   17:30   0:00 /usr/sbin/cron -f
message+     415  0.0  0.1   8208  4680 ?        Ss   17:30   0:01 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root         422  0.0  0.0 220740  3956 ?        Ssl  17:30   0:00 /usr/sbin/rsyslogd -n -iNONE
root         424  0.0  0.1  21908  7144 ?        Ss   17:30   0:00 /lib/systemd/systemd-logind
root         425  0.0  0.2 390204 10536 ?        Ssl  17:30   0:00 /usr/libexec/udisks2/udisksd
root         438  0.1  0.0      0     0 ?        I    17:30   0:13 [kworker/0:3-events]
root         515  0.0  0.1 234236  7672 ?        Ssl  17:30   0:00 /usr/libexec/polkitd --no-debug
root         615  0.0  0.0   5784  1736 tty1     Ss+  17:30   0:00 /sbin/agetty -o -p -- \u --noclear tty1 linux
root         930  0.0  0.0   6876  2856 tty5     Ss   17:33   0:00 /bin/login -p --
joe          936  0.0  0.2  15388  8792 ?        Ss   17:33   0:00 /lib/systemd/systemd --user
joe          937  0.0  0.0 100920  2652 ?        S    17:33   0:00 (sd-pam)
joe          943  0.0  0.1   8240  4848 tty5     S    17:33   0:00 -bash
root        1007  0.0  0.0   6864  3212 tty6     Ss   17:45   0:00 /bin/login -p --
root        1030  0.0  0.0      0     0 ?        I    18:21   0:03 [kworker/u8:0-ext4-rsv-conversion]
joe         1038  0.0  0.1   8112  4664 tty6     S+   18:28   0:00 -bash
root        1047  0.0  0.0      0     0 ?        I    18:33   0:00 [kworker/2:0-cgroup_destroy]
root        1048  0.0  0.0      0     0 ?        I    18:33   0:00 [kworker/3:2-events]
root        1078  0.0  0.0      0     0 ?        I    19:21   0:00 [kworker/1:1-mm_percpu_wq]
root        1080  0.0  0.0      0     0 ?        I    19:26   0:00 [kworker/1:0-mm_percpu_wq]
root        1081  0.0  0.0      0     0 ?        I    19:29   0:00 [kworker/1:3-ata_sff]
root        1084  0.0  0.0      0     0 ?        I    19:30   0:00 [kworker/u8:2-flush-8:0]
root        1087  0.0  0.0      0     0 ?        I    19:30   0:00 [kworker/3:0-events]
root        1089  0.0  0.0      0     0 ?        I    19:36   0:00 [kworker/1:2-mm_percpu_wq]
root        1095  0.0  0.0      0     0 ?        I    19:37   0:00 [kworker/0:0-cgroup_destroy]
joe         1189  0.3  0.1   7896  4352 ?        Ss   19:39   0:00 /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
root        1363  0.0  0.0      0     0 ?        I    19:39   0:00 [kworker/2:1-mm_percpu_wq]
root        1364  0.0  0.0      0     0 ?        I    19:40   0:00 [kworker/u8:1-events_unbound]
root        1373  0.0  0.0      0     0 ?        I    19:40   0:00 [kworker/u8:3]
joe         1379  0.0  0.0   9700  3336 tty5     R+   19:40   0:00 ps -aux

Logs during this run: https://gist.githubusercontent.com/tryphe/cbfe7fd727180c5a3b93cfeb5dd298a4/raw/66b16c07a34d7470d7b69aea79b9d982a5690e01/gistfile1.txt

There is a reoccurring problem with boost::wrapexcept<boost::bad_weak_ptr> around the point of failure, but only after some other call in a thread has failed.

tryphe commented 2 years ago

I'll build with make debug and report back with hopefully readable traces :)

selsta commented 2 years ago

I'll try to sync up on a VPS and see if I can reproduce. Just to confirm, you used the master branch?

tryphe commented 2 years ago

I'll try to sync up on a VPS and see if I can reproduce. Just to confirm, you used the master branch?

Thanks. Indeed, head is at d562deaaa950979b7a31a441a8f02a00013e26d6

tryphe commented 2 years ago

After 10 minutes of running I was able to reproduce the bug in debug mode. 7 blocks sync and then it gets stuck, lots of network chatter but no blocks being added. I gave the hangup signal after 10 minutes to keep the log smaller.

$ cat ~/.bitmonero/bitmonero.log | grep "Your node"
2022-03-04 01:58:25.247 [shortened for readability] Sync data returned a new top block candidate: 2512046 -> 2571962 [Your node is 59916 blocks (2.7 months) behind] 
2022-03-04 02:01:49.392 [shortened for readability] Sync data returned a new top block candidate: 2512053 -> 2571966 [Your node is 59913 blocks (2.7 months) behind] 
2022-03-04 02:03:42.697 [shortened for readability] Sync data returned a new top block candidate: 2512053 -> 2571968 [Your node is 59915 blocks (2.7 months) behind] 
2022-03-04 02:08:20.795 [shortened for readability] Sync data returned a new top block candidate: 2512053 -> 2571970 [Your node is 59917 blocks (2.7 months) behind] 

Log with --log-level 1: https://gist.githubusercontent.com/tryphe/bc38a41e752fd127d3092901e1178051/raw/c17ca0b278450c5d29d89014604eb9fe87861525/gistfile1.txt

Please let me know if there is anything else I can provide.

tryphe commented 2 years ago

It looks like the errors are possibly unrelated to the syncing problem. I tried a handful of old release versions back as far as 0.17.1.3. I still experience the syncing problem but no runtime errors.

selsta commented 2 years ago

I started the sync process on a fresh VPS now. Will update you if it syncs fine or if I gets stuck too.

selsta commented 2 years ago

8 hours later, so far no issues

Synced 2505056/2575077 (97%, 70021 left)
selsta commented 2 years ago

I unfortunately couldn't reproduce this on a fresh VPS with v0.17.3.0. Sync was without any issues.

tryphe commented 2 years ago

I'm not really sure what the differences are between our machines.

While syncing past 95%, it takes me somewhere between 5 to 40 minutes to experience the stuck syncing problem. While fully synced, it takes somewhere between 0 to 5 hours to experience the stuck syncing problem.

I'm on a residential connection so I have some packet loss. About 0.05% to 0.1%. Maybe that's the problem? I'll try and induce more packet loss to see if it makes the problem worse.

selsta commented 2 years ago

Residential connection with minor packet loss should be fine.

The logs you posted did hint at some networking related issues, but it's not exactly clear yet.

Regloom commented 2 years ago

I have exactly the same problem on my home pc running latest Arch linux kernel (5.16.14-arch1-1), monerod is launched inside docker container (as per https://sethforprivacy.com/guides/run-a-p2pool-node/). Absolutely the same behavior as described by @tryphe. Will try to sync natively in arch with monero CLI. Overall network download speed according to "sync_info" seems to drop significantly to 3-60kB/s after 3-30mins of monerod container launch after 97% sync (block 2500000).

selsta commented 2 years ago

@Regloom just to confirm, it gets completely stuck? or just slow sync?

selsta commented 2 years ago

After block 2478000 there are no more checkpoints so sync will significantly slow down.

Regloom commented 2 years ago

it gets completely stuck? or just slow sync?

It slows down first to very low speed and then it stucks (no new blocks within hours). Restarting monerod helps for a while and then again is slows down and stops syncing. Now trying to sync without docker in Arch itself.

Regloom commented 2 years ago

Now trying to sync without docker in Arch itself.

Same behaviour. Here is a log 0 of monerod.

selsta commented 2 years ago

Can you also post sync_info output?

Gingeropolous commented 2 years ago

have you tried adding specific peers?

you can add a priority one with

--add-priority-node 176.9.0.187

or

--add-exclusive-node 176.9.0.187

that IP is that of xmrchain.net , so it should work.

selsta commented 2 years ago

Yea my idea was too to add a couple exclusive nodes to make sure it's not a malicious peer.

Regloom commented 2 years ago

have you tried adding specific peers?

Thanks for advice! I'll give it a try.

selsta commented 2 years ago

Make sure to add both --add-exclusive-node 88.198.199.23 and --add-exclusive-node 176.9.0.187, syncing from just one peer can be spotty. The other IP is mine.

Regloom commented 2 years ago

Can you also post sync_info output?

Here you are.

Regloom commented 2 years ago

Make sure to add both --add-exclusive-node 88.198.199.23 and --add-exclusive-node 176.9.0.187, syncing from just one peer can be spotty. The other IP is mine.

Executed monerod with --add-exclusive-node=88.198.199.23:18081 --add-exclusive-node=176.9.0.187:18081 Now it shows me that I'm fully synced: ~/monero $ ./monero-x86_64-linux-gnu-v0.17.3.0/monerod status 2022-03-17 20:11:59.127 I Monero 'Oxygen Orion' (v0.17.3.0-release) Height: 2507075/2507075 (100.0%) on mainnet, not mining, net hash 2.65 GH/s, v14, 0(out)+0(in) connections, uptime 0d 0h 8m 4s is it correct?

selsta commented 2 years ago

That doesn't sound right.

Is it a node with restricted RPC?

Regloom commented 2 years ago

That doesn't sound right.

Is it a node with restricted RPC?

Well, I guess so. Here is my current config. I've tried several other options (commented) but no luck.

selsta commented 2 years ago

The first step would be to try to sync without any config at all, just a custom data dir if necessary. Also adding --prune-blockchain is fine if you don't have enough space for the full blockchain, but apart from that no config.

Regloom commented 2 years ago

OK, here is sync_info logs after night running: monerod --prune-blockchain --add-priority-node=88.198.199.23 --add-priority-node=176.9.0.187 It stuck on Height: 2507351/2581844 (97.1%).

I've also tried --add-exclusive-node and it blocked both nodes after 20 mins:

2022-03-18 06:04:56.330 I SYNCHRONIZATION started 2022-03-18 06:05:09.986 I Host 88.198.199.23 blocked. 2022-03-18 06:08:12.024 W monerod is now disconnected from the network 2022-03-18 06:08:12.978 I [176.9.0.187:18080 OUT] Sync data returned a new top block candidate: 2507361 -> 2582080 [Your node is 74719 blocks (3.4 months) behind] 2022-03-18 06:08:12.978 I SYNCHRONIZATION started 2022-03-18 06:08:24.067 W monerod is now disconnected from the network 2022-03-18 06:08:24.747 I [176.9.0.187:18080 OUT] Sync data returned a new top block candidate: 2507371 -> 2582080 [Your node is 74709 blocks (3.4 months) behind] 2022-03-18 06:08:24.747 I SYNCHRONIZATION started 2022-03-18 06:08:35.676 I Host 176.9.0.187 blocked.

selsta commented 2 years ago

@Regloom can you also upload log level 2 logs? You don't have to add any priority nodes.

Regloom commented 2 years ago

can you also upload log level 2 logs?

Here you are, level 2 logs without priority nodes. bitmonero.log.zip

p.s. Also tried running from latest Debian Live image and got the same result. At least distro is not the cause. Can it be due to my hardware (e.g. AMD Athlon II X4 6400e CPU with no L3 cache and AES support)?

tryphe commented 2 years ago

Fwiw, I'm also running on an AMD FX 8300 series, but it's a different socket (AM3+ vs AM3 for Athlon II) and different die size (32nm vs 45nm for Athlon II).

Regloom commented 2 years ago

AMD FX 8300 series,

That might be somehow related. What's your motherboard?

tryphe commented 2 years ago

AMD FX 8300 series,

That might be somehow related. What's your motherboard?

I think it's this thing: ASUS TUF SABERTOOTH 990FX

AMD 990FX chipset.

I'll move the problematic VM to a newer Intel hypervisor to attempt to rule out any hardware issue. Will try to report back in a few days.