Rebalancing will bloat latency times on a pi4

GordianLN commented 2 years ago

Background

Raspberry Pi 4 8GB, baremetal setup with no other services than the minimum (no TH RTL mempool or whatever, bitcoind+lnd+electrum personal server only). I had sync postgres replication active, and I disabled even that one (to no/negligible avail). One of my services does bos probe for every peer of mine in turn, each about 13 minutes after the other cyclically, keeps 2 days logs, shows average of last (peers number) runs. Changing nothing other than enabling or disabling my rebalancer based on rebalance-lnd, avg latency to my peers will range from 5s with no rebalancer, to above 20s with rebalancer at 5 threads. During the 5-threads rebalance runs, system load will reach 4.5 per 5' as per htop, during single thread rebalance runs 5' avg load gets down to less than 3; no matter this, latency times whenever even just one rebalance is happening, increase over time. With postgresql, 8gb of ram, and a 1.8GHz arm64 cpu I would expect no noticeable/minimal consequences on probe results.

Your environment

version of lnd 0.15
which operating system (uname -a on *Nix) raspios bullseye 64bit
version of btcd, bitcoind, or other backend .22
any other relevant environment details //

Steps to reproduce

Have from 1 to 5 rebalances running on a Pi4, keep track of bos probe latency times to all of your peers

Expected behaviour

latency times do increase, but in the amount of 10-15%, not 300% ...

Actual behaviour

Latency to channel peers goes from little more than 5s avg, to well over 20s avg

Roasbeef commented 2 years ago

Hard to say what's actually going on without a profile: https://github.com/lightningnetwork/lnd/blob/a329c8061266418bccd797aa5d7d277c736ee930/docs/debugging_lnd.md#capturing-pprof-data-with-lnd. Another helpful link on obtaining a profile: https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/. What you want here is to generate a CPU profile right when you trigger a rebalance or to the pool to see the response time of peers.

It's also worth noting that the underlying database supports a single writer a time, a rebalance can generate a lot of failures which means more writes, and a remote DB can generate additional network latency in the process. A profile will show you exactly where time is being spent, which should help you debug this issue.

GordianLN commented 2 years ago

Do I need to use the /debug/pprof/profile endpoint for specific CPU profiling? As the docs link you provided only shows the goroutines endpoint. Also, can I inconsequentially leave the profile= line in lnd.conf even when I don't need it anymore? From my understanding it will not generate any load on the system unless I do connect to that port and ask for a profile, right?

GordianLN commented 2 years ago

Wait, am I going to share sensitive data by attaching the CPU profile output in here? I see a binary file and I do not have the tools to read it installed (nor I would probably be able to understand zip if I had)

guggero commented 2 years ago

No, there is no sensitive data in the CPU profile. You can use go tool pprof <file.prof> and then execute web to show the contents in a web browser.

GordianLN commented 2 years ago

No, there is no sensitive data in the CPU profile. You can use go tool pprof <file.prof> and then execute web to show the contents in a web browser.

ok I trust you Oliver. Self explanatory, profile is taken while just being online, profilerebalance a short while after I started my 5-thread rebalance-lnd routine profiles.zip .

Roasbeef commented 2 years ago

Looking at the profile, it's spending most of the time in DB queries, or waiting for postgres to respond. Most fo the time is actually spent in GetInfo, where it needs to fetch all the channels for a given node. Screen Shot 2022-07-07 at 12 19 03 PM

Is the postgres server running on the same machine, or a remote one with possibly high latency?

One curious thing there is the call to withLock. AFAICT, that's in the stdlib package and looks to be grabbing an exclusive lock to generate the query.

Roasbeef commented 2 years ago

Another shot showing more context w.r.t het path down: Screen Shot 2022-07-07 at 12 20 31 PM

GordianLN commented 2 years ago

Oh wow... well I do use lncli getinfo at least every minute in my own scripts, plus I have lndg set to fetch updates from lnd DB every few seconds, but that is the baseline that happens no matter what, the rebalancer will run lncli listchannels | jq -r '.[][] | .remote_pubkey, .pending_htlcs[] ' | grep -B 1 { | grep <pubkey> before each call to rebalance-lnd to know beforehand if there's any outgoing htlcs on that channel and skip launching rebalance in that case, but really, everything that consistently happens for 99% of the time when rebalancing is active, over the baseline, is rebalance-lnd doing its job, and obviously postgresql doing its additional job. I do run DB on the same system, in a raid1 array on btrfs partitions, and those partitions are in fact encrypted with luks across two kingston a400 ssd's which are cacheless... but I doubt this is generating an overload for the processor, cpu usage by postgres as reported by atop is 35%-ish. Is withLock an issue that could prevent DB from running smoothly? I had set up everything for replication, but due to this incident I disabled replication and also removed the replication slot from the main postgres instance, and dropped the second postgres instance running off a third ssd, nothing related to replication should be happening anymore.

GordianLN commented 2 years ago

So, anything specific I can do or try on my side to get at this?

GordianLN commented 2 years ago

Could a watchtower over tor cause the high latency I see during rebalances?

Roasbeef commented 2 years ago

Rather than poll list channels, you could use the SubscribeHtlcEvents RPC call. The pipeline for that streaming RPC call never hits disk, and would make sure you always have consistent view of the set of HTLCs. You'd first subscribe to this call, then do ListChannels to start w/ to build up your view of the HTLC. After the snapshot is obtained, you'll be notified of any other changes.

I do run DB on the same system, in a raid1 array on btrfs partitions, and those partitions are in fact encrypted with luks across two kingston a400 ssd's which are cacheless... but I doubt this is generating an overload for the processor, cpu usage by postgres as reported by atop is 35%-ish.

If possible, you might want to consider investing in even just slightly beefier hardware. It could be a decade old laptop or something, you wouldn't have to invest much to get hardware more powerful and reliable than a rasp pi.

On the postgres side, it might be fruitful to look at any instrumentation available w/ the server. So to be able to see things like which queries are taking the longest, etc. Alternatively, you might try reducing the number of active rebalance threads.

With withLock thing was just a hunch, I haven't dug deeper yet to see if it was actually an issue or not.

Zooming out, with 0.16 we plan to start migrating heavy loaded sections of the DB over to pure SQL. This issue shows that we should also migrate the core channel state as well, since things like GetInfo and other scripts might end up rapidly polling the state. Another shorter term solution here would be more client side caching, which would directly cut down on latency, but add most custom DB logic that might be error prone.

Do you know what your config options are for the max number of postgres connection slots?

GordianLN commented 2 years ago

you could use the SubscribeHtlcEvents RPC call. The pipeline for that streaming RPC call never hits disk, and would make sure you always have consistent view of the set of HTLCs.

My relationship with gRPC in PHP is not ideal. Rather, I don't have such relationship, at all, and learning to do gRPC from PHP is something I'd like to do, just not in the next few days. But I'd have other means to "fix" that by not issuing a lncli listchannels, I can just run lndg updater every 3 seconds. In any case, there is just one listchannels call before each rebalance-lnd cycle, which in itself runs for a few minutes, so I wouldn't thing listchannels is a party pooper by itself.

you might want to consider investing in even just slightly beefier hardware

On my radar, just not enthusiastic about upgrading hardware (and energy costs) while Pi4 doesn't really sweat that much. In any case, I observed that even a single rebalance thread will hit the latency times.

On the postgres side, it might be fruitful to look at any instrumentation available w/ the server. So to be able to see things like which queries are taking the longest, etc.

Duly noted

Do you know what your config options are for the max number of postgres connection slots?

By suggestion of @joostjager I have that limit set at 0 in lnd.conf

Roasbeef commented 2 years ago

By suggestion of @joostjager I have that limit set at 0 in lnd.conf

IIUC, setting a value of zero here actually means you'll have an unbounded amount of connections. Depending on your postgres settings, this might be leading to some errors and more retried queries. Based on this material, I think setting an explicit value is more predicable from a memory/cpu consumption perspective.

GordianLN commented 2 years ago

I think setting an explicit value is more predicable from a memory/cpu consumption perspective.

I went to set a limit, and actually discovered I have nothing set for db.postgres.maxconnections, since postgres.conf defaults to 100 max I have set it to 21 in lnd.conf for good measure.

lightningnetwork / lnd