mynodebtc / mynode

The easiest way to run Bitcoin and Lightning!
Other
649 stars 148 forks source link

LND failing to sync graph and peers [new] #619

Closed btcschellingpt closed 2 years ago

btcschellingpt commented 2 years ago

Describe the bug When MyNode starts, after Bitcoin has successfully started (always), the lnd service starts but main page states "Waiting" despite lnd.log showing sync complete, and at chaintip as new blocks issued. lncli getinfo shows all channels as inactive, no peers, and chain and graph sync as false .. despite lnd.log showing ongoing sync to chaintip as new blocks validated.

Expected behavior Expect lnd service to successfully complete startup, graph to be sync'd and all channels and peers to be active - as has been the case quite happily for the last several years. Successful completion of lnd service start then enables other lnd-dependent services to start (currently not occurring).

Screenshots Glances grabbed showing lnd service running - typically v low CPU - in shot higher so probs validating new block image

Desktop (please complete the following information):

MyNode hardware (please complete the following information):

Additional context Node became somewhat intermittent during January - but eventually lnd failed to start as above. Upgraded to v2.50 to determine if the upgrade would resolve the issue - it didn't. Began trouble shooting. Acquired new MicroSD card and flashed the currently downloadable image (v2.48) onto it; the prior MicroSD card in use since inception of this MyNode instance in Nov-19. Same behaviour exhibited under reflashed card.

Only the following applications are installed:

Have worked through a number of enquiry lines with g.petrov in tg group and attach additional log files:

I note that line 1083 of mynode.log indicates an error related to comms - which is kinda what this issue indicates - lnd is failing to communicate with the lightning network - not sure what it means/to do there though.

btcschellingpt commented 2 years ago

After thinking about this, shutdown the node, unplugged, re-plugged and then restarted and then collected the attached logs for the syslog and lnd

lnd.log syslog.log

tukac commented 2 years ago

This issue has some similarities with yours:

https://giters.com/rootzoll/raspiblitz/issues/2532

Run this and compare to issue 2532 to see if it is the same: sudo journalctl -n100 -fu lnd

tukac commented 2 years ago

If not the same try googling your syslog.log error to find similar issue:

"[lncli] rpc error: code = Unknown desc = waiting to start, RPC services not available"

btcschellingpt commented 2 years ago

@tukac Thanks - some similarities indeed although appears to be a different cause. In the blitz issue, the root cause seems to have been problems in the db. Here, it looks like there's a possible problem between lnd and it's ability to communicate with electrs via the secure electrs proxy which is in an endless loop of starting-failing to start .. thanks for the pointer - still digging into it

tukac commented 2 years ago

Yes, this: Jan 24 03:54:37 myNode mynode_gen_cert_electrs.sh[4727]: cp: cannot open '/home/bitcoin/.mynode/electrs/myNode.local.key' for reading: Permission denied Jan 24 03:54:37 myNode systemd[1]: tls_proxy.service: Control process exited, code=exited, status=1/FAILURE Jan 24 03:54:37 myNode systemd[1]: tls_proxy.service: Failed with result 'exit-code'.

Did you try to regenerate Electrum server certificate in settings?

btcschellingpt commented 2 years ago

Thanks for the suggestion - tried regenerating the server certs and restarting - that also to no avail.

tukac commented 2 years ago

Maybe try uninstalling Electrum server.

tehelsper commented 2 years ago

Jan 24 03:54:37 myNode mynode_gen_cert_electrs.sh[4727]: cp: cannot open '/home/bitcoin/.mynode/electrs/myNode.local.key' for reading: Permission denied This is unusual, but probably has nothing to do with LND.

I don't see anything obvious in the lnd log file. It looks like it's running pretty normally, except that all channels are inactive and myNode isn't detecting a good status for LND.

This issue may also be similar - https://github.com/mynodebtc/mynode/issues/624

When you re-flashed, did you change to 64-bit? Were you on 32 or 64 before?

How large is your channel db file and when did the issues start? I noticed the log says the last auto-compact time was on Jan 15th.

Edit: Dumb question - you're on 64-bit as seen in the log. Also, I removed the mynode log since it contained your product key.

btcschellingpt commented 2 years ago

@tukac - thanks but I flashed current 2.48 release onto new SD card so ALL apps are effectively re-installed

@tehelsper

  1. When I flashed onto new card I selected current (non-legacy) - so 64bit
  2. Current db size is 794Mb (see screenshot) and I recall seeing it at 980Mb
  3. Lightning node has been running happily since Nov-19 but has become erratic Dec-21/Jan-22

Next? I could try uninstalling, then re-installing electrs to see if that resolves the issue - but if I do so, following the uninstall, should I then remove the electrs folder and electrs files in the .bitcoin/mynode folder, before re-installing? Or does the uninstall do that? As it will take so long to rebuild the index, I would want to be sure there's no old config elements lying about :-)

I could try renaming the macaroon per #624 but see no reference to macaroons in the logs - although symptoms of issue are the same; only discernible difference is my instance is not running a watchtower. Thoughts?

Pursuing the comms-is-the-failure line (per electrs), would regenerating the TLS certification for LND help?

Selection_800

Thanks both!

tukac commented 2 years ago

Maybe it would be worth to try 32-bit 0.2.48 image sd card, in case transition from 32 to 64 bit was the cause...

tehelsper commented 2 years ago

@btcschellingpt Could you send a longer log? I noticed in the other issue that LND was actually restarting after a period of time.

If you could also watch the CPU and RAM usage via glances or htop, that may help as well in case it's getting killed due to memory issues or somthing.

btcschellingpt commented 2 years ago

@tehelsper

(1) Have extracted all logs per functionality and sent via tg. Looking at syslog.1, there appears to be a failure in the corsproxy for the btcrpc connection - in this log that appears to be retried over 1,000 times. A btc rpc failure would be consistent with lnd not having any frens to talk to and therefore no sync of graph etc.

(2) Resource metrics look pretty stable - I did kick off the upgrade to 2.50 and the shot below shows a day and a half of running since then - this is the point that the logs were extracted.

Selection_801

tehelsper commented 2 years ago

I've looked into the logs and corsproxy does have an issue, but it doesn't seem to be the cause of the issue. It's only for communication from Caravan to BTC. I'll keep looking.

tehelsper commented 2 years ago

Is it acting any differently now? Looking at the LND log, it seemed to be syncing blocks OK and wasn't logging any obvious errors.

btcschellingpt commented 2 years ago

Darn - was hopeful that the corsproxy was the cause - useful to know that it's only used for Caravan.

So the key observable aspect is that some lncli actions fail with RPC errors:

The LND message indicates that the service doesn't think it has finished startup - which is why no LND dependent services are starting - and indicates that there's an RPC issue. What I'm unclear on is if it's an RPC issue between lncli and lnd, or an RPC issue between lnd and bitcoind. More likely the former than the latter since the logs show that lnd is happily syncing to the chaintip as bitcoind recognises and validates each new block.

Which would indicate that this is more of a pure LND issue rather than a package issue - perhaps pursue this with lnd team?

Other option is to build new stack with identical bitcoind and lnd on new platform, close down lnd, and migrate the lnd instance to new platform... I have a new intel platform available but as this is also a change of cpu arch, it's not recommended by lnd team.

Any other ideas?

tehelsper commented 2 years ago

Unfortunately, now I'm less sure it's a communication issue with Bitcoin. I see a number of logs that show LND is seeing new blocks, but for some reason cannot indicate it is synced to the chain.

Can you run "lncli getinfo" again to see if "synced_to_chain" is true or false? Most mynode services rely on LND indicating that is set to true before starting up.

I think sharing the logs with the LND team would be the most helpful at this point. The issue seems internal to LND. I was also missing some of the lnd log so I've added a change to make sure more of the lnd log is included in the debug tarball.

btcschellingpt commented 2 years ago

Yep - "lncli getinfo" continues to show both syncd_to_chain and sycnd_to_graph as false - stubbornly consistent.

Ok - I'll head over to lnd team to see if they can assist - thanks for the steer.

btcschellingpt commented 2 years ago

@tehelsper I may have found something!

The two attached flask file logs show an error attempting to access the admin.macaroon - returns exit code 2 - and then subsequently fails to establish a connection - presumably b/c the authority provisioned by the admin.macaroon aren't supplied.

Could this be the cause, and if so, how to regenerate the admin macaroon?

flask.1.log flask.2.log

tehelsper commented 2 years ago

@btcschellingpt I doubt that is the root cause, but if that continued to cause errors, there may be a problem with the macaroons. Can you try running this command manually to see if an error is returned?

xxd -ps -u -c 1000 /mnt/hdd/mynode/lnd/data/chain/bitcoin/mainnet/admin.macaroon

If it's a macaroon issue, you should be able to delete all *.macaroon files from the /mnt/hdd/mynode/lnd/data/chain/bitcoin/mainnet/ folder along with the macaroons.db file and resdtart lnd or reboot. It should regen them on restart.

btcschellingpt commented 2 years ago

@tehelsper - the xxd command appeared to work successfully - at least no error msg - output attached. You were correct that removal of the macaroons made absolutely no difference.

Unfortunately at this point I think I have to start working through the disaster recovery steps in the builders guide. :-(

REMOVED

tehelsper commented 2 years ago

Yeah, if there is anything in there that helps, the lnd guys may have better ideas since the issue really seems to be with the channel state. Can you see the channel info on the LND page in myNode?

Also, I removed the link to the macaroon since that data is private. Probably not a big deal if you had reset your macarooons though.

btcschellingpt commented 2 years ago

Thanks for removing the xxd output.

Zero channel info on the LND page on myNode. I can see the channels listed from lncli - all inactive - so yes that's the same conclusion I've reached to. There was probably a hardware failure on the Pi at some point causing lnd to crash and with it, some data loss. Going to have to bite the bullet and ask everyone to force close :-( and then rebuild the lnd instance .. this time on HA hardware which it deserves given channel volume.

Thanks for all your help - let's close this issue.