Poor route selection - Githubissues

rkfg commented 2 years ago

Just had a situation yesterday when Blixt couldn't send a payment because of no liquidity on the route (and I repeated it 3 times after timeouts) while SBW did it in just two attempts (≈2 seconds). Background:

I have Blixt connected to my own node via a private channel
the payment was to WalletOfSatoshi node using a static LNURL QR
I monitored the attempts on my node with lntop
the amount was quite small, 10k sats

The attempts went fine through my node (no link failure), they failed somewhere later on the route. After about 10 attempts the payment fails due to timeout, I tried again and then again without success. The total number of attempts should be close to 30. Then I started SBW (also connected to my own node via a private channel), scanned the same code, entered the same amount and from the lntop log I saw this:

SBW tried to pay using my direct channel to WoS which was drained down to the reserved limit, it failed
SBW then tried to pay through ACINQ with higher fee and succeeded

I know that the LN implementation in SBW is different and doesn't use lnd unlike Blixt. Probably it has different priorities for fee selection so it can sacrifice a few sats to make the payment faster and Blixt tries to save on fees but it results in very poor UX. There's a time_pref parameter in lnd 0.15.0, maybe it can help with this after your version is updated. Or something else can be tweaked to prefer more reliable paths instead of the cheapest ones. In my case the difference was just 5 sats when paid by SBW and Blixt tried to get 0 or 2 at most. I don't think 5 sats are worth the trouble when the wallet doesn't do its primary job.

Darth-Coin commented 2 years ago

If you have only ONE channel with your node, and your node do not find enough paths to forward that tx, where is the issue with Blixt? Open more channels on your Blixt, as we suggest: one with your node, one with Blixt node, one with ZFR or any other good positioned nodes. Then use the MPP.

Using just ONE channel with your node you are ALWAYS depending of your own node liquidity and paths. You are concentrating all the flow into just one pipe. Blixt have nothing to do with this, it is your node that couldn't forward correctly the tx. First learn how LN works and manage correctly your node. Yes, there are some issues with LN path finding, in general, but that have nothing to do with Blixt, and in special with your particular case.

rkfg commented 2 years ago

If you have only ONE channel with your node, and your node do not find enough paths to forward that tx, where is the issue with Blixt?

That's how most mobile wallets operate, no? Why would I need more than one channel to my well connected node that itself has 50+ channels? I'm interested in using my own node, not 3rd parties. And as I said it works perfectly with SBW but they use their own LN implementation. Also just one channel to my node.

Blixt have nothing to do with this, it is your node that couldn't forward correctly the tx.

Incorrect. Blixt does the route search, not my node. It's one of its value propositions compared to trampoline routing in Electrum, for example.

First learn how LN works and manage correctly your node.

Thanks, I guess I have enough knowledge of it, been learning since 2021 down to the commitment structure, HTLC stages and onion routing. I think I can tell what component is responsible for the failures. In my case Blixt was trying to find the cheapest route using my channels which is suboptimal imo. The failures were not on my node, it's not a liquidity problem. The payments failed further down the route and I have nothing to do with it. What Blixt can do, however, is to raise the fee limit and time preference (but it's a newer option as I said so might require lnd update). Then it would choose a more expensive but working route instead of getting stuck in cheap but failing ones.

Darth-Coin commented 2 years ago

You are wrong in your assumptions. Yes, on a mobile node you don't need 50 channels, but at least 3 max 5-6 are recommended for reliability. Never depend only on one single pipe. Also with more channels you can take the advantage of MPP. Yes, Blixt is doing the route search, but if it have ONLY your node as 1st hop always, all routes will depend on YOUR node not on Blixt search. Is your node that is giving the rest of the route, not Blixt. Is exactly like a network of pipes.

You need to learn more how to manage your routing node. There is the problem. Choose wisely your peers if you want to use it as LSP for your Blixt mobile.

So again, this is NOT a Blixt issue.

rkfg commented 2 years ago

Have you read my message at all? I told you I don't have these issues with another wallet that is SBW which is connected exactly the same way to my own node. I did several tests, Blixt struggles to find a route, SBW does it instantly even though that route might be slightly more expensive. But it DOES work. So yes, this IS a Blixt issue. Why disregard it if it clearly can be improved? My node is only responsible for those 50+ channels but the rest is out of my reach. And the faliure was not on those channels, they have plenty of liquidity as I already said several times. Also, MPP should work even with one channel, I did payments with lncli and MPP and it sent multiple shards along the same route in parallel (all hops were identical).

Please don't derail the problem discussion and put the blame on me, it's unprofessional.

rkfg commented 1 year ago

Any ideas, @hsjoberg? It's still a big issue for those who want to use their own nodes. I suppose the core issue is lnd doing a poor job. What I see is Blixt picks 1-2 peers next to my node and tries to pay via them completely ignoring the rest and failing many time despite there's being a lot of liquidity in some more expensive channels. It looks like this:

Blixt => my full node => peer A => some other nodes Blixt => my full node => peer B => some other nodes

I set the max fee to 50% and it still didn't help. It appears to me that lnd uses the depth-first Dijkstra search algorithm instead of a more appropriate breadth-first, imo. It's more likely that a peer has low outbound liquidity further down the path if it failed a couple of times so it's probably better to switch to another peer instead of bashing this one for a minute and timing out.

Tried bimodal pathfinding, same issue. I'm just trying a small amount like 19-250 sats at https://satoshis.place and Blixt never succeeds. SBW (that uses their own lightning implementation and as such, different pathfinding) pays very quickly after a couple of failed attempts. Both connected to my own full node that, as before, has plenty of liquidity in both directions, reasonable fees and even provides rough liquidity hints with max_htlc.

Also tried resetting the graph and syncing it again from scratch, it finally went through after a couple of timeouts. After the reset Blixt keeps harassing the Boltz route (which is known to have very poor outbound liquidity as they don't give any hints), doesn't even look at the other nodes/channels. Before the reset it was trying with a different node on every try.

djkazic commented 1 year ago

@rkfg try opening a channel to my node from your Blixt Wallet: https://amboss.space/node/028c589131fae8c7e2103326542d568373019b50a9eb376a139a330c8545efb79a

rkfg commented 1 year ago

Thanks, but the point is to use my own node and it should work regardless of other channels, not trying just one channel to a node that fails often. Otherwise I'd rather open to a big well connected node so that I always have both inbound and outbound through it.

djkazic commented 1 year ago

Have you tried using the rapid LN sync feature?

djkazic commented 1 year ago

It sounds like you maybe have an incomplete graph. Show a screenshot of your get network info output.

rkfg commented 1 year ago

Screenshot_20231005-010322

rapid LN sync feature

Couldn't find such a setting. Where is it?

djkazic commented 1 year ago

Yeah that's your problem. You got tons of zombies. Update to the latest version. The feature is called scheduled LN channel sync.

hsjoberg commented 1 year ago

See the changelog for https://github.com/hsjoberg/blixt-wallet/releases/tag/v0.6.8 for more info.

rkfg commented 1 year ago

I have version 0.6.8 from Google Play, looks like it's the latest one. In "Show node data" both synced to chain and graph are true. Is that a bug then that it accumulates these zombie channels? I remember resetting the graph before and it loaded and worked fine for some time. I enabled scheduled LN channel sync but honestly this shouldn't be such a hard requirement that disabling this setting renders the wallet unusable and this state seems to be non-recoverable automatically, unless I reset the graph again.

rkfg commented 1 year ago

Ah, I remember why I turned this setting off: the startup time became even bigger than usual with a spinner "Syncing lightning network" which I see now that I restarted the app. I'm not sure what exactly it syncs because after opening the main activity there's still a sync icon and the channel isn't active until it disappears. Honestly, the startup time is enormously long.

Even after this sync I still have zombies, even more now: 74592.

djkazic commented 1 year ago

That means it's failing to do the gossip sync.

It's a critical part of Blixt that is supposed to stop LND from flagging a majority of the graph as unusable (zombies). I recommend fully stopping the app, clearing the cache, and then starting Blixt. After the sync you should have ~2k zombies.

rkfg commented 1 year ago

The cache was very small in my case, just 1.5 Mb. The storage is 2.1 Gb though but clearing it would nuke everything. I cleared the cache and restarted the app, I don't see any sync progress as the network info is the same.

djkazic commented 1 year ago

Sounds like something didn't work correctly. You are correct to not clear the data, please do not do that.

Please do the following procedure:

First, force stop Blixt app.
Secondly, clear cache.
Third, launch Blixt.
The scheduled sync page should load.

Also, make sure you are on the latest version of the wallet.

rkfg commented 1 year ago

I did exactly that, I see the "Syncing Lightning Network" page, after that the app starts, the network info is still the same. The version is 0.6.8 as I stated above.

djkazic commented 1 year ago

OK, hold on. I'm doing a server-side refresh.

djkazic commented 1 year ago

Please try the procedure again.

rkfg commented 1 year ago

Now the number of zombies is zero and it looks like the graph is re-syncing. What changed exactly? Is Blixt dependent on some server I don't know about? I set my own node in the settings, I'd rather have zero dependencies on other servers if possible.

hsjoberg commented 1 year ago

What changed exactly? Is Blixt dependent on some server I don't know about?

@rkfg Yes, there is a server involved in speedloader. See more info here (and also the 0.6.8 changelog as aforementioned): https://twitter.com/BlixtWallet/status/1674029478115266560

There is no hard dependency, no. However, unless your Blixt always is online* in order to receive gossip data, your channel graph database will be degraded over the time. This is a consequence of gossip messages being missed, which causes channels to be marked as unusable ("zombies") by lnd.
Lnd and Lightning was designed with the assumption that nodes would always be online.

*) We will introduce a persistent app mode in the next version 0.6.9 on Android. This is an alternative to speedloader as Blixt/lnd will always be online in order to receive gossip data. Some Android phone vendors also include an app pinning functionality that lets you achieve something akin to this.

I set my own node in the settings, I'd rather have zero dependencies on other servers if possible.

Understandable. We launched this initial version of speedloader server hardcoded in Blixt to use https://maps.eldamar.icu/mainnet/graph/graph-001d.db, https://maps.eldamar.icu/mainnet/graph/ and https://maps.eldamar.icu/mainnet/graph/MD5SUMS.

In the next version 0.6.9, we will have a setting for changing the speedloader server and also instructions on how to set it up yourself.

rkfg commented 1 year ago

Thank you for the clarification, there's no information about how this speedloader thing actually works except in your message. I get that it's a technical quirk but I think it should be explained explicitly because it's a centralizing feature. I hope there will be a way to run such a server on my own infra and specify it in the settings. It's also weird that SBW (the version that still supported LN) doesn't suffer from this.

djkazic commented 1 year ago

Anyone can run this, it's open source software:

You can even run your own in the future, we're just running the default one because you need a powerful CPU to calculate differential transfer deltas.

It's also an optional feature, using persistent mode also helps prevent graph degradation on Blixt.

Also, SBW does not use LND. So it isn't valid to compare the two directly on this. SBW very likely has a different approach to graph management / pruning.

djkazic commented 1 year ago

Just as an example of it being extensible, Zeus with embedded lnd runs their own primer server for speedloader capabilities. So rest assured it's not a black box

rkfg commented 1 year ago

Thank you! I think this method is better than the persistent mode because it would probably drain the battery if the phone can't sleep. I previously assumed that lnd already downloads these channel announcement deltas because it's in the log (applying gossipFilter start=...) so it looked similar to how bitcoind operates, querying the announcements it missed when it was offline. Turns out it's more fragile. Is it possible in theory to fix it in lnd itself so no external daemon is needed? Or does the LN spec in its current form prevent it?

djkazic commented 1 year ago

The problem isn't lnd per-se, it's that lnd was designed for servers and were running it inconsistently on a phone.

Gossip sync is eventually consistent, so while you could technically restore your zombies to zero organically by having it on all the time that's impractical for most users.

Therefore we compress the set of changes to the channel graph using speedloader, instead of your lnd needing to download all the deltas you can now download one big delta and apply it.

djkazic commented 1 year ago

IIRC CLN has a similar piece of software that does this set compression, and Breez wallet bootstrapping is the basis for speedloader tech. It's become a common primitive for mobile lightning environments

rkfg commented 1 year ago

Ah, got it. Yeah, lnd usually sends updates once a day at least or more often if the channel policy changes so I'd need to run it at least 24 hours straight for all active channels to be updated, and even longer to catch up on the inactive ones that become active in the future. Interesting, I tinker with LN for about 2 years now and didn't know about this problem. I expected the node to catch up right after connecting to some peers and querying for what it missed.

djkazic commented 1 year ago

That is exactly what it does. When you see that sync has stopped, that doesn't actually mean you have the full graph

It just means you don't have any additional nodes or edges that your peers know about (but your node doesn't). As a result you can still have an incomplete graph.

The way that zombie marking works is that you need to see an announcement for it to come back, and many times that can be missed with a delta of up to a day IME.

djkazic commented 1 year ago

So yes while in theory leaving it open would fix it, in reality it can often require even longer periods to organically self repair.

Unsurprising since most deployments of lnd run 24/7.

rkfg commented 1 year ago

Same problem again. I updated to 0.6.9 using the APK from the releases. Then checked out network info and it's just as bad: Screenshot_20231112-200952 I force stopped Blixt, cleared cache, restarted it, no changes. I had speedloader enabled at all times, even though launching Blixt took 5+ minutes sometimes (hopefully now it should be faster thanks to the diffs). I'm not even sure if it works now because: Screenshot_20231112-201902

I find this issue pretty critical if it accumulates like that over time and just kills pathfinding, especially if it's not even possible for the user to recover from this state. And even fewer people would notice it at all.

niteshbalusu11 commented 1 year ago

@rkfg can you clear the cache and restart Blixt once again and send the debug log for the session too?

djkazic commented 1 year ago

Looking into this now

djkazic commented 1 year ago

OK, interesting. This looks like an edge case around speedloader logging. Let me see if I can repro it.

rkfg commented 1 year ago

Stopped/cleared/restarted, no visible change: Screenshot_20231112-203736 Log:


[Info]   Store: initialize done
  {"data":""}
[Info]   Store: Starting lnd, gossipStatus
  null
[Info]   Google: Initializing
  "android"
[Info]   Google: hasPlayServices
[Info]   Google: after
[Info]   GoogleDriveBackup: Starting channel update subscription for Google Drive channel backup
[Info]   Store: Current lnd state
  {"state":"WAITING_TO_START"}
[Info]   Store: Current lnd state
  {"state":"LOCKED"}
[Info]   Store: Current lnd state
  {"state":"UNLOCKED"}
[Info]   Store: Current lnd state
  {"state":"RPC_ACTIVE"}
[Info]   Lightning: Starting
[Info]   Channel: Starting channel update subscription
[Info]   OnChain: Initializing
[Info]   LightningBox: Initializing Lightning Box subsystem
[Info]   DeeplinkManager: New app state found
[Info]   DeeplinkManager: Checking deeplink
[Info]   ClipboardManager: checkInvoice
[Info]   ClipboardManager: try
  "804889"
[Info]   DeeplinkManager: New app state found
[Info]   Lightning: Recovery info
  {}
  false
  false
  0
[Info]   Lightning: Recovery either finished or not activated
[Info]   ClipboardManager: checkInvoice
[Info]   DeeplinkManager: New app state found
[Info]   DeeplinkManager: Checking deeplink
[Warning] NotificationManager: Post notification permission was denied
  "never_ask_again"
[Info]   Lightning: Setting up Autopilot
[Info]   Store: Current lnd state
  {"state":"SERVER_ACTIVE"}
[Info]   ClipboardManager: checkInvoice
[Info]   Lightning: Autopilot status:
  {}
[Info]   Receive: Transaction subscription started
[Info]   DeeplinkManager: New app state found
[Info]   DeeplinkManager: New app state found
[Info]   DeeplinkManager: Checking deeplink
[Info]   ClipboardManager: checkInvoice

djkazic commented 1 year ago

Are you running a hardened OS?

rkfg commented 1 year ago

No, just a stock one. Moto G7 Power.

djkazic commented 1 year ago

OK when you got the logs here:

[Info]   Store: Starting lnd, gossipStatus
  null

Was this after a cache clear and restart?

rkfg commented 1 year ago

Yes.

djkazic commented 1 year ago

OK it looks to me like speedloader didn't run at all then, as the gossipStatus value is set to null by default and then updated by the call to gossipSync()

djkazic commented 1 year ago

I'll do a server-side refresh now. Please do a cache clear and restart, then grab app logs like last time. Specifically interested in what gossipStatus is.

djkazic commented 1 year ago

OK just did a refresh.

djkazic commented 1 year ago

I reviewed some logs and the logging here is consistent with LN sync being disabled. Could you try disabling it, clearing cache, loading the app, then enabling the app, clearing cache, and loading the app?

rkfg commented 1 year ago

I did this, now I get "Syncing Lightning Network" screen as before. After logging in, there are no zombie channels as well. Note, that after I disabled the setting, cleared the cache and launched the app, this checkbox somehow enabled itself back. So I disabled and reenabled it once again, cleared cache and restarted the app.

niteshbalusu11 commented 1 year ago

I did this, now I get "Syncing Lightning Network" screen as before. After logging in, there are no zombie channels as well. Note, that after I disabled the setting, cleared the cache and launched the app, this checkbox somehow enabled itself back. So I disabled and reenabled it once again, cleared cache and restarted the app.

Thank you for helping us debug. The pathfinding issues should now be fixed. We'll look into why it was failing in the first place.

rkfg commented 1 year ago

I get this once a day in the log after I don't use Blixt for a while. Does it mean that the whole state is redownloaded fully every time? Loading the app takes about a minute or longer while it stays at the "Syncing Lightning Network" activity.

Screenshot_20231116-121159

niteshbalusu11 commented 1 year ago

It runs if the previous sync was greater than 24hrs ago. In theory it should be a much smaller start time because it should only download the difference in graph state and not the whole graph, however it doesn't seem to be working as expected. @djkazic is working on it.

djkazic commented 1 year ago

The patching limit is 2 days. If your graph goes older than that it will fallback to full download.

rkfg commented 1 year ago

Thanks! Maybe increase it to a week at least? Or tell explicitly in the log that the gap is too big. I really wish it started as fast as possible so that I can at least accept payments if not send, and this sync continued in background. But if it patches the lnd database I suppose it shouldn't be running during this...

hsjoberg / blixt-wallet

Poor route selection #988