[bug]: LND with Neutrino never fully starts up

gkrizek commented 8 months ago

Background

We have had several nodes that fail to start up when using Neutrino. This seems to have started with 0.17.0 I believe because we haven't experienced this previously. Essentially after a restart the node sits at RPC_ACTIVE indefinitely and never reaches SERVER_ACTIVE. When running a getinfo call, it reports as synced_to_chain: false. I inspected the Neutrino instances on all affected nodes and neutrino is reporting as synced and has at least 8 peers. In the logs I can see that the node seems to be syncing the chain because it continuously spits out:

2023-11-07 13:56:50.367 [INF] BTCN: Processed 1 block in the last 26m56.33s (height 815717, 2023-11-07 13:56:31 +0000 UTC)
2023-11-07 13:30:04.036 [INF] BTCN: Verified 1 filter header in the last 10m37.3s (height 815716, 2023-11-07 13:29:46 +0000 UTC)
2023-11-07 13:29:54.033 [INF] BTCN: Attempting to fetch set of un-checkpointed filters at height=815716, hash=000000000000000000043a2be219746d6751b79cc75139aaabc402f7ca7a3b6f

However, looking at closer to the restart time, it seems like it's stuck rescanning the chain. Filtering only on the LNWL logs I can see this:

2023-11-05 07:34:47.189 [WRN] LNWL: Query(0) from peer 91.134.145.202:8333 failed, rescheduling: did not get response before timeout
2023-11-05 07:34:39.188 [WRN] LNWL: Query(0) from peer 70.68.70.171:8333 failed, rescheduling: did not get response before timeout
2023-11-05 07:34:35.188 [WRN] LNWL: Query(0) failed with error: did not get response before timeout. Timing out.
2023-11-05 07:34:35.188 [WRN] LNWL: Query(0) from peer 5rljfc3fdgnhpsvuju3jsblsdktyjnpzedr264plemli2f6hjv4wutyd.onion:8333 failed, rescheduling: did not get response before timeout
2023-11-05 07:32:07.051 [INF] LNWL: Started rescan from block 00000000000000000000a6d3415f09ab5439ad2dd262c7840a4dded9be41271c (height 814836) for 14 addresses
2023-11-05 07:32:05.335 [INF] LNWL: The wallet has been unlocked without a time limit
2023-11-05 07:32:01.650 [INF] LNWL: Starting web API fee estimator
2023-11-05 07:32:01.138 [INF] LNWL: Opened wallet

And nothing meaningful happens after that. So it seems like with Neutrino LND gets stuck syncing the chain and never finishes. I don't know of a way to get more info on the status of re-syncing.

Your environment

version of lnd: 0.17.0
which operating system (uname -a on *Nix): amazon linux 2
version of btcd, bitcoind, or other backend: neutrino
any other relevant environment details

Steps to reproduce

I don't know of a way to reliably reproduce this. It seems fairly random when it happens.

Expected behaviour

LND starts up like normal and is able to continue operations

Actual behaviour

LND gets stuck starting and sits for days without being able to do anything.

Output

$ lncli neutrino status
{
    "active": true,
    "synced": true,
    "block_height": 815717,
    "block_hash": "0000000000000000000219e6da7c862f6c10bf236fbde945640789ee7657e3d6",
    "peers": [
        "lvsf4gflp5drwy3sfvqk4lklltb4ul4uehyepvvpfvqncn6gnue5abid.onion:8333",
        "4h62mftc5bo3lz5dluty3s42todzd4jjmagbbeqbbccmunmaa7h6ocyd.onion:8333",
        "vpgtduxbksexfp5twhzlzmtzayxci6ui4axgp6xaoayg5o6l242v2xqd.onion:8333",
        "78.203.56.219:8333",
        "62.117.216.176:8333",
        "svbdvvsvpvjiotkspmwas6vghpb3jwa2piuos7kixoqy2ar7xbwnz4yd.onion:8333",
        "35.163.36.246:8333",
        "4q6xkttr2vvycn6e4uw55feflsq3oyrmazn7tfz45yxni73yhjj42jid.onion:8333"
    ]
}

$ lncli getinfo
{
    "version": "0.17.0-beta commit=lightning-terminal-v0.11.0-alpha",
    "commit_hash": "8d7c53b2d1ccab6abc41a01b7c61e874cf194542",
    "identity_pubkey": "<redacted>",
    "alias": "<redacted>",
    "color": "#ff5000",
    "num_pending_channels": 0,
    "num_active_channels": 0,
    "num_inactive_channels": 6,
    "num_peers": 0,
    "block_height": 815717,
    "block_hash": "0000000000000000000219e6da7c862f6c10bf236fbde945640789ee7657e3d6",
    "best_header_timestamp": "1699365391",
    "synced_to_chain": false,
    "synced_to_graph": false,
    "testnet": false,
    "chains": [
        {
            "chain": "bitcoin",
            "network": "mainnet"
        }
    ],
    "uris": [
    ],
...
}

gkrizek commented 8 months ago

Also, switching the backend from Neutrino to btcd resolves the issue. It can perform a rescan and start up.

Roasbeef commented 8 months ago

Do you also have tor logs? It looks like all the connections are timing out. Assuming this is a pure (non-hybird) tor config as well?

Roasbeef commented 8 months ago

In this state, does it still get in new block headers as well? Or is it totally stuck trying to fetch block headers?

gkrizek commented 8 months ago

@Roasbeef This node has the --tor.skip-proxy-for-clearnet-targets flag enabled. So I'm assuming it wouldn't use Tor for clearnet bitcoin nodes?

I can see from neutrino status and the logs that it is indeed getting new block headers. Neutrino still says it's synced to the chain but getinfo says no.

For Tor logs, nothing too interesting. A few HSDir errors.

Nov 08 17:45:01.000 [notice] Closed 1 streams for service [scrubbed].onion for reason resolve failed. Fetch status: No more HSDir available to query.
Nov 08 19:30:52.000 [notice] Heartbeat: Tor's uptime is 3 days 12:00 hours, with 3 circuits open. I've sent 47.45 MB and received 79.27 MB. I've received 1306 connections on IPv4 and 0 on IPv6. I've made 8 connections with IPv4 and 0 with IPv6.
Nov 08 19:30:52.000 [notice] While bootstrapping, fetched this many bytes: 663887 (consensus network-status fetch); 11908 (authority cert fetch); 4582319 (microdescriptor fetch)
Nov 08 19:30:52.000 [notice] While not bootstrapping, fetched this many bytes: 1357366 (consensus network-status fetch); 1078735 (microdescriptor fetch)
Nov 08 19:30:52.000 [notice] Average packaged cell fullness: 11.867%. TLS write overhead: 4%

Roasbeef commented 8 months ago

When it's in that state, are you able to obtain a goroutine dump? That would let us see what the wallet is up to. AFAICT, rescan failed (?) as it wasn't able to get some blocks in time from peers.

Also do you have logs from start up onwards? That might show if something happened w/ the wallet on the backend.

gkrizek commented 8 months ago

Here's a goroutine dump of it in the idle state. goroutine.zip

I also turned on debug logging to see if that helped at all. I filtered on the same LNWL logs and it still looks like it's stuck rescanning but we are connecting and disconnecting from peers

2023-11-09 02:40:50.890 [DBG] LNWL: Peer 54.213.127.21:8333 for worker disconnected
2023-11-09 02:40:50.890 [DBG] LNWL: Starting worker for peer 54.213.127.21:8333
2023-11-09 02:39:14.690 [DBG] LNWL: Starting worker for peer 34samw5duaktk7z5vsps2qvscv4vphqfrpaoqtbrsvmti6aupqsnzaqd.onion:8333
2023-11-09 02:38:52.595 [DBG] LNWL: Peer 186.218.58.150:8333 for worker disconnected
2023-11-09 02:38:52.452 [DBG] LNWL: Starting worker for peer 186.218.58.150:8333Show context
2023-11-09 02:38:51.992 [DBG] LNWL: Peer 4ikbxzfsuymqulk7xrh65w2ifcdg6yeyrlbb7iiqhakfoxds3gp3ihid.onion:8333 for worker disconnected
2023-11-09 02:38:51.982 [DBG] LNWL: Peer 4engo7p5nea6hbojw3bi5ykjqavbzmdgv3gr3umvgdcxb4t3ewjijiid.onion:8333 for worker disconnected
2023-11-09 02:38:50.353 [DBG] LNWL: Peer 51.158.54.115:8333 for worker disconnected
2023-11-09 02:38:50.226 [DBG] LNWL: Peer 98.127.230.215:8333 for worker disconnected
2023-11-09 02:33:00.910 [DBG] LNWL: Starting worker for peer 4ikbxzfsuymqulk7xrh65w2ifcdg6yeyrlbb7iiqhakfoxds3gp3ihid.onion:8333
2023-11-09 02:32:55.876 [DBG] LNWL: Peer 173.212.220.216:8333 for worker disconnected
2023-11-09 02:32:55.346 [DBG] LNWL: Starting worker for peer 173.212.220.216:8333
2023-11-09 02:32:55.262 [DBG] LNWL: Starting worker for peer 98.127.230.215:8333
2023-11-09 02:32:55.112 [DBG] LNWL: Peer m665l2x4zmaphdm4gd2qa72eerw2du2msfzxs6ia2znkyndy33elktyd.onion:8333 for worker disconnected
2023-11-09 02:32:54.886 [DBG] LNWL: Peer olhpaqojesa5did2flrztexuiuk4e2nbpid4f7vtgjueojylgvpjgqid.onion:8333 for worker disconnected
2023-11-09 02:28:27.300 [DBG] LNWL: Starting worker for peer 4engo7p5nea6hbojw3bi5ykjqavbzmdgv3gr3umvgdcxb4t3ewjijiid.onion:8333
2023-11-09 02:24:22.930 [DBG] LNWL: Starting worker for peer m665l2x4zmaphdm4gd2qa72eerw2du2msfzxs6ia2znkyndy33elktyd.onion:8333
2023-11-09 02:22:17.333 [DBG] LNWL: Starting worker for peer olhpaqojesa5did2flrztexuiuk4e2nbpid4f7vtgjueojylgvpjgqid.onion:8333
2023-11-09 02:22:13.058 [DBG] LNWL: Peer 4ikbxzfsuymqulk7xrh65w2ifcdg6yeyrlbb7iiqhakfoxds3gp3ihid.onion:8333 for worker disconnected
2023-11-09 02:22:11.606 [DBG] LNWL: Peer arube62wpn2dl3kyfnm3yqtzmqisxwujj2e3hdic4nazero6f6lkr7id.onion:8333 for worker disconnected
2023-11-09 02:22:10.248 [DBG] LNWL: Peer 70.68.70.171:8333 for worker disconnected
2023-11-09 02:09:21.955 [DBG] LNWL: Starting worker for peer 70.68.70.171:8333
2023-11-09 02:09:21.889 [DBG] LNWL: Peer 2x3qnjjxxwkix7vgfbdkqvswnpfk3pc7qbhpy24u7skdaao3koqepjid.onion:8333 for worker disconnected
2023-11-09 01:44:51.049 [DBG] LNWL: Starting worker for peer n35aod4iraixvkzd5pwvkkyld4zwkog7cyaxs6cqwovstg6a26idkgid.onion:8333
2023-11-09 01:42:44.329 [DBG] LNWL: Peer 80.253.94.252:8333 for worker disconnected
2023-11-09 01:42:43.164 [DBG] LNWL: Starting worker for peer 80.253.94.252:8333
2023-11-09 01:40:50.690 [DBG] LNWL: Starting worker for peer 51.158.54.115:8333
2023-11-09 01:40:42.406 [DBG] LNWL: Starting worker for peer 2x3qnjjxxwkix7vgfbdkqvswnpfk3pc7qbhpy24u7skdaao3koqepjid.onion:8333
2023-11-09 01:40:33.989 [DBG] LNWL: Peer pfbwr6jvtyge3itqguf5bfhy5si4olumedf3ot2dlfxu27oxoamsngyd.onion:8333 for worker disconnected
2023-11-09 01:40:33.989 [DBG] LNWL: Peer s2d52bbttuwcl3pdrwzhxpmhtxn3jg23havjqg5eygwhtiw6lgyelpqd.onion:8333 for worker disconnected
2023-11-09 01:40:33.498 [DBG] LNWL: Peer r4nmrs2ewv2bfs6z4z7a6ypcey6ndmyweuw5puktnlpcesl3e25pauqd.onion:8333 for worker disconnected
2023-11-09 01:37:30.608 [DBG] LNWL: Starting worker for peer r4nmrs2ewv2bfs6z4z7a6ypcey6ndmyweuw5puktnlpcesl3e25pauqd.onion:8333
2023-11-09 01:37:25.847 [DBG] LNWL: Peer 204.16.244.120:8333 for worker disconnected
2023-11-09 01:37:25.055 [DBG] LNWL: Starting worker for peer 204.16.244.120:8333
2023-11-09 01:33:31.954 [DBG] LNWL: Starting worker for peer 4ikbxzfsuymqulk7xrh65w2ifcdg6yeyrlbb7iiqhakfoxds3gp3ihid.onion:8333
2023-11-09 01:33:28.114 [DBG] LNWL: Peer 134.122.124.10:8333 for worker disconnected
2023-11-09 01:33:26.239 [DBG] LNWL: Peer 217.79.181.38:8333 for worker disconnected
2023-11-09 01:31:37.330 [DBG] LNWL: Starting worker for peer 134.122.124.10:8333
2023-11-09 01:31:34.709 [DBG] LNWL: Starting worker for peer 217.79.181.38:8333
2023-11-09 01:31:29.564 [DBG] LNWL: Peer y74nborr62sic5b4tmpbvllhcshjynorx2e4noju64wszldpm6yjw7ad.onion:8333 for worker disconnected
2023-11-09 01:31:29.292 [DBG] LNWL: Peer v7a457n6habghy6hprebu25z4ewp7mg327qayeglc5ylmh4wdhwjqhid.onion:8333 for worker disconnected
2023-11-09 01:31:27.314 [DBG] LNWL: Starting worker for peer xw64elpjz65dbzx2siom7vuk5bdcgcpgxywkc4zxvajals4fcciuhxid.onion:8333
2023-11-09 01:27:48.141 [DBG] LNWL: Starting worker for peer arube62wpn2dl3kyfnm3yqtzmqisxwujj2e3hdic4nazero6f6lkr7id.onion:8333
2023-11-09 01:27:35.497 [DBG] LNWL: Starting worker for peer s2d52bbttuwcl3pdrwzhxpmhtxn3jg23havjqg5eygwhtiw6lgyelpqd.onion:8333
2023-11-09 01:27:25.325 [DBG] LNWL: Starting worker for peer v7a457n6habghy6hprebu25z4ewp7mg327qayeglc5ylmh4wdhwjqhid.onion:8333
2023-11-09 01:27:13.180 [DBG] LNWL: Peer olhpaqojesa5did2flrztexuiuk4e2nbpid4f7vtgjueojylgvpjgqid.onion:8333 for worker disconnected
2023-11-09 01:27:11.994 [DBG] LNWL: Peer ryzng757sohnpwruiistrxkhpvtkasfz74rtbr3of72zzgiijyi65iid.onion:8333 for worker disconnected
2023-11-09 01:27:11.547 [DBG] LNWL: Peer v4sxmvmr5a7tqzbeshldnn4q4izpnvvdssv4pyx5wj7uuv7tty53lwid.onion:8333 for worker disconnected
2023-11-09 01:27:10.917 [DBG] LNWL: Peer 156.34.170.87:8333 for worker disconnected
2023-11-09 01:23:29.191 [DBG] LNWL: Starting worker for peer y74nborr62sic5b4tmpbvllhcshjynorx2e4noju64wszldpm6yjw7ad.onion:8333
2023-11-09 01:23:29.179 [DBG] LNWL: Starting worker for peer v4sxmvmr5a7tqzbeshldnn4q4izpnvvdssv4pyx5wj7uuv7tty53lwid.onion:8333
2023-11-09 01:23:29.070 [DBG] LNWL: Starting worker for peer pfbwr6jvtyge3itqguf5bfhy5si4olumedf3ot2dlfxu27oxoamsngyd.onion:8333
2023-11-09 01:23:23.561 [DBG] LNWL: Peer ziuig4bwgrdgblle4q76kovuwby76hc3633jz3ecfdfdsknnm3y5sfad.onion:8333 for worker disconnected
2023-11-09 01:23:22.873 [DBG] LNWL: Peer dculvkhptkmr655fgsypq6t7j36nyrdamnmc7febjgscq2iqqsyl2fad.onion:8333 for worker disconnected
2023-11-09 01:23:22.231 [DBG] LNWL: Peer 50.117.132.178:8333 for worker disconnected
2023-11-09 01:21:30.904 [DBG] LNWL: Starting worker for peer olhpaqojesa5did2flrztexuiuk4e2nbpid4f7vtgjueojylgvpjgqid.onion:8333
2023-11-09 01:21:20.570 [DBG] LNWL: Starting worker for peer dculvkhptkmr655fgsypq6t7j36nyrdamnmc7febjgscq2iqqsyl2fad.onion:8333
2023-11-09 01:21:16.014 [DBG] LNWL: Peer 81.28.248.132:8333 for worker disconnected
2023-11-09 01:21:15.999 [DBG] LNWL: Starting worker for peer 81.28.248.132:8333
2023-11-09 01:19:15.743 [DBG] LNWL: Peer 88.13.3.193:8333 for worker disconnected
2023-11-09 01:19:14.446 [DBG] LNWL: Starting worker for peer 88.13.3.193:8333
2023-11-09 01:18:24.457 [WRN] LNWL: Query(0) from peer 50.117.132.178:8333 failed, rescheduling: did not get response before timeout
2023-11-09 01:17:52.456 [WRN] LNWL: Query(0) from peer sp72v3vixgs632tqsyawt3jbfsz4wum2p35wy6ihadlzpl5fes5svdqd.onion:8333 failed, rescheduling: did not get response before timeout
2023-11-09 01:17:20.455 [WRN] LNWL: Query(0) from peer ryzng757sohnpwruiistrxkhpvtkasfz74rtbr3of72zzgiijyi65iid.onion:8333 failed, rescheduling: did not get response before timeout
2023-11-09 01:16:48.455 [WRN] LNWL: Query(0) from peer 50.117.132.178:8333 failed, rescheduling: did not get response before timeout
2023-11-09 01:16:16.454 [WRN] LNWL: Query(0) from peer ziuig4bwgrdgblle4q76kovuwby76hc3633jz3ecfdfdsknnm3y5sfad.onion:8333 failed, rescheduling: did not get response before timeout
2023-11-09 01:16:00.453 [WRN] LNWL: Query(0) from peer sp72v3vixgs632tqsyawt3jbfsz4wum2p35wy6ihadlzpl5fes5svdqd.onion:8333 failed, rescheduling: did not get response before timeout
2023-11-09 01:15:52.452 [WRN] LNWL: Query(0) from peer 156.34.170.87:8333 failed, rescheduling: did not get response before timeout
2023-11-09 01:15:48.452 [DBG] LNWL: Batch 0 timed out
2023-11-09 01:15:48.452 [WRN] LNWL: Query(0) failed with error: did not get response before timeout. Timing out.
2023-11-09 01:15:48.452 [WRN] LNWL: Query(0) from peer 135.181.215.237:8333 failed, rescheduling: did not get response before timeout
2023-11-09 01:13:30.380 [DBG] LNWL: Starting worker for peer ziuig4bwgrdgblle4q76kovuwby76hc3633jz3ecfdfdsknnm3y5sfad.onion:8333
2023-11-09 01:13:21.708 [DBG] LNWL: Peer 149.102.139.50:8333 for worker disconnected
2023-11-09 01:13:21.474 [DBG] LNWL: Starting worker for peer 149.102.139.50:8333
2023-11-09 01:13:21.047 [DBG] LNWL: Peer 35.208.62.130:8333 for worker disconnected
2023-11-09 01:13:20.863 [DBG] LNWL: Starting worker for peer 35.208.62.130:8333
2023-11-09 01:13:13.852 [DBG] LNWL: Peer pfbwr6jvtyge3itqguf5bfhy5si4olumedf3ot2dlfxu27oxoamsngyd.onion:8333 for worker disconnected
2023-11-09 01:13:13.443 [DBG] LNWL: Starting worker for peer 50.117.132.178:8333
2023-11-09 01:13:13.326 [DBG] LNWL: Peer 6uwtmmk7okzjibyas456nmgkpsfug6qqalvaktnyux4icgnq7cz452id.onion:8333 for worker disconnected
2023-11-09 01:13:13.131 [DBG] LNWL: Peer f4az4lpww5qvl7yydkwxsxfvacpmkdxcjfsjg3xqeomrdd3yplaig7ad.onion:8333 for worker disconnected
2023-11-09 01:11:57.566 [DBG] LNWL: Adding new batch(0) of 1 queries to work queue
2023-11-09 01:11:57.562 [INF] LNWL: Started rescan from block 00000000000000000000a6d3415f09ab5439ad2dd262c7840a4dded9be41271c (height 814836) for 14 addresses
2023-11-09 01:11:56.848 [DBG] LNWL: Chain backend synced to tip!
2023-11-09 01:11:56.473 [DBG] LNWL: Starting worker for peer 156.34.170.87:8333
2023-11-09 01:11:55.847 [DBG] LNWL: Waiting for chain backend to sync to tip
2023-11-09 01:11:55.847 [DBG] LNWL: Birthday block has already been verified: height=753971, hash=0000000000000000000130a1fbd4c75691bf953c255bf15af8f508d3c8e887b5
2023-11-09 01:11:55.846 [INF] LNWL: The wallet has been unlocked without a time limit
2023-11-09 01:11:52.155 [INF] LNWL: Starting web API fee estimator
2023-11-09 01:11:51.646 [INF] LNWL: Opened wallet

Additionally, filterd on BTCN

2023-11-09 01:11:58.450 [DBG] BTCN: Received cfilter from 135.181.215.237:8333 (outbound)
2023-11-09 01:11:58.408 [DBG] BTCN: Received cfilter from 135.181.215.237:8333 (outbound)
2023-11-09 01:11:57.998 [DBG] BTCN: Received cfilter from 135.181.215.237:8333 (outbound)
2023-11-09 01:11:57.998 [DBG] BTCN: Received cfilter from 135.181.215.237:8333 (outbound)
2023-11-09 01:11:57.566 [DBG] BTCN: Sending getcfilters to 135.181.215.237:8333 (outbound)
2023-11-09 01:11:57.566 [DBG] BTCN: Fetching filters for heights=[814837, 815836], stophash=00000000000000000002c0528e1a208d3e9533e30428b652e0d7f07d9868531b
2023-11-09 01:11:57.563 [DBG] BTCN: Starting rescan from known block 814836 (00000000000000000000a6d3415f09ab5439ad2dd262c7840a4dded9be41271c)
2023-11-09 01:11:57.563 [DBG] BTCN: Waiting for the chain source to be current or for the rescan end height to be reached.
2023-11-09 01:11:57.563 [INF] BTCN: Canceling block subscription: id=2
2023-11-09 01:11:56.850 [DBG] BTCN: Rescan became current at 815922 (00000000000000000004627a6807c0cf862322d586e1c7ed1c1d83e993086f25), subscribing to block notifications
2023-11-09 01:11:56.850 [DBG] BTCN: Delivering backlog of block notifications: id=2, start_height=815922, end_height=815922
2023-11-09 01:11:56.850 [INF] BTCN: Registering block subscription: id=2
2023-11-09 01:11:56.850 [DBG] BTCN: Starting rescan from known block 815922 (00000000000000000004627a6807c0cf862322d586e1c7ed1c1d83e993086f25)
2023-11-09 01:11:56.849 [DBG] BTCN: Waiting for the chain source to be current or for the rescan end height to be reached.
2023-11-09 01:11:56.562 [DBG] BTCN: Received feefilter from 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.562 [DBG] BTCN: Sending pong to 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.561 [DBG] BTCN: Received ping from 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.561 [DBG] BTCN: Received unknown message from 156.34.170.87:8333 (outbound): received unknown message
2023-11-09 01:11:56.561 [DBG] BTCN: Received unknown message from 156.34.170.87:8333 (outbound): received unknown message
2023-11-09 01:11:56.561 [DBG] BTCN: Received sendheaders from 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.473 [INF] BTCN: New valid peer 156.34.170.87:8333 (outbound) (/Satoshi:23.0.0/)
2023-11-09 01:11:56.473 [DBG] BTCN: New peer 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.472 [DBG] BTCN: Connected to 156.34.170.87:8333
2023-11-09 01:11:56.472 [DBG] BTCN: Received verack from 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.472 [DBG] BTCN: Received sendaddrv2 from 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.472 [DBG] BTCN: Sending verack to 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.472 [DBG] BTCN: Sending sendaddrv2 to 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.471 [DBG] BTCN: New time offset: 0s
2023-11-09 01:11:56.471 [DBG] BTCN: Added time sample of 0s (total: 17)
2023-11-09 01:11:56.471 [DBG] BTCN: Negotiated protocol version 70016 for peer 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.471 [DBG] BTCN: Received version (agent /Satoshi:23.0.0/, pver 70016, block 815922) from 156.34.170.87:8333 (outbound)
2023-11-09 01:11:56.315 [DBG] BTCN: Sending version (agent /btcwire:0.5.0/neutrino:0.12.0-beta/, pver 70016, block 815922) to 156.34.170.87:8333 (outbound)

Roasbeef commented 7 months ago

2023-11-09 01:11:57.563 [DBG] BTCN: Starting rescan from known block 814836 (00000000000000000000a6d3415f09ab5439ad2dd262c7840a4dded9be41271c)

Looks like it's pretty far back on the rescan, a bit over 20k blocks. So just fetching all that data might take some time. We optimized things a lot fetching wise in 0.17, but perhaps the nodes its connected to are starting to rate limit it?

gkrizek commented 7 months ago

@Roasbeef - Yes, I think it happens more often on nodes that have fallen behind significantly (from shutdowns or whatever).

When this happens does the rescan keep trying or stop? I can see in neutrino is connected to new peers, so is it still trying to rescan from those?

I think it would take some time but not 5+ days... A brand new neutrino node will go from 0 to fully synced in a hour. It seems like once it hits the errors in rescan like: [DBG] LNWL: Batch 0 timed out - it just stops completely.

Roasbeef commented 7 months ago

I think the issue here is that there isn't a ton of visibility to track the diff between: the current height neutrino is synced at vs the current wallet rescan progress. Synced to chain will return false until both those heights match exactly.

A brand new neutrino node will go from 0 to fully synced in a hour.

Yeah I can sync up a new neutrino node fully on my laptop in like < 5 mins, will be faster in 0.18 as well via the header side load and parallel download. Rescan takes long as it needs to:

fetch the filters
scan the filters w/ known addresses
maybe fetch blocks
repeat until at chain tip

AFAICT, it seems like batch timeout retry isn't working like it should. More debug logs for either BTCN or LNWL should help give more information. Also I'm not sure what p2p message rate limiting on bitcoind looks like these days, I know in the past they added upload limits, which when hit peers stop responding to requests.

gkrizek commented 7 months ago

@Roasbeef What else can I provide to help debug? The logs above are debug logs for BTCN and LNWL. Do I need to do trace? I'm happy to provide whatever else I can do get to the bottom of this.

gkrizek commented 7 months ago

Hey @Roasbeef pinging on this again as it's still an issue. Anything additional I can provide?

Roasbeef commented 7 months ago

@gkrizek the goroutine dump didn't tell much, can you try to obtain one directly from the profiling port? So the text based one.

Also as requested above, can you give us logs over a longer period of time? The logs above cover immediate start up, but with just that alone we can't tell if the rescan is making any more progress. trace would be helpful for BTCN and LNWL. AFAICT, rn it's just taking a while to rescan, but info so far hasn't been able to determine: if the rescan is going, if it errored out, etc.

Do these nodes also have access to a local node with lower network latency and dedicated capacity?

Roasbeef commented 7 months ago

Looking at this log trace we can see that each time a tor worker is started, it gets disconnected pretty much right after. In addition AFAICT, most of the peers are tor. Does each node have it's own tor daemon, or are they shared across a set of nodes? All the queries timeout, which seems to point to a networking issue.

gkrizek commented 7 months ago

Well... this is fairly odd but all of the nodes that were experiencing this are resolved after another restart. I was grabbing more logs and profiles and it started working. So perhaps it was networking related in Tor or something. I'll keep this open for just a bit as we continue to investigate. But if it turns out to just be networking I'll close.

I'll report back. Thanks.

lightningnetwork / lnd