lightningnetwork / lnd

Lightning Network Daemon ⚡️
MIT License
7.65k stars 2.07k forks source link

lncli sendpayment/payinvoice hangs with no response when capacity is insufficient #536

Closed sangaman closed 6 years ago

sangaman commented 6 years ago

I've observed this behavior on multiple nodes so it isn't isolated. Calling lncli payinvoice or lncli sendpayment --pay_req when channel capacity is insufficient causes the commands to hang.

From the receiving node:

$ lncli addinvoice --value 100000
{
        "r_hash": "2a5630d4f1fdd01b45ade6194930a43d9ae0c4658993e998d970b8dc43066468",
        "pay_req": "lntb1m1pdydtu8pp59ftrp483lhgpk3dducv5jv9y8kdwp3r93xf7nxxewzudcscxv35qdqqcqzys9q062arzhw759uv6twpndw0dscrc3u7m8vxhdzf62gcacv3kdq2kqyha9tcleghxkt20yszlvp4efw44fv4fr6r6yj98d79pg7sjfqgqwjefzx"
}

Then trying to pay the invoice on the sending node using one of the commands above then causes the console to hang. I grabbed the following from the lnd output:

2017-12-29 21:12:47.722 [ERR] CRTR: Attempt to send payment 2a5630d4f1fdd01b45ade6194930a43d9ae0c4658993e998d970b8dc43066468 failed: TemporaryChannelFailure: insufficient capacity in available outgoing links: need 100002200 mSAT, max available is 52796000 mSAT
2017-12-29 21:12:49.450 [ERR] CRTR: Attempt to send payment 2a5630d4f1fdd01b45ade6194930a43d9ae0c4658993e998d970b8dc43066468 failed: TemporaryChannelFailure
Roasbeef commented 6 years ago

If you look at the logs for Mission Control, do you see any lingering activity? Atm, we don't have an upper bound on our number of attempts, which itself is a bug. If your machine is slow, or is doing a ton of other activities, then with the way our timeouts are considered, it may prune an edge, attempt another payment, then by the time the payment fails, that edge is unpruned again.

sangaman commented 6 years ago

I don't see Mission Control in the logs but I do see multiple lines of Attempt to send payment over ~2 minutes or so, I think that's about how long I waited before killing the command . One machine I observed this on was slow, the other was using modern hardware. If I see this again I'll let it run for a while and keep a close eye on the logs.

MrHash commented 6 years ago

I have been getting sending/routing hangs as well. I hadn't correlated it to a funding issue, but for example i sent small payments to htlc.me which worked then shortly after a larger payment which hangs. Subsequent attempts with lower amounts are also hanging until a service restart. queryroutes often hangs as well even with small amounts. I'm running this on a relatively slow machine also.

MrHash commented 6 years ago

Update: queryroutes did work with small amounts but took half an hour and returned 100s of results!

halseth commented 6 years ago

queryroutes and sendpayment are doing path finding a bit differently, #379 should fix a lot of the slowness seen in queryroutes.

sendpayment is using MissionControl, which should show up in the logs with prefix CRTR: if you are using debug logs :)

sangaman commented 6 years ago

@halseth good to know. I do see all the Attempt to send payment messages are prefaced with CRTR:. I haven't encountered this issue again yet but I'll update if I do.

vegardengen commented 6 years ago

Is this the same as this one? Totally reproducible here right now, will try not to change anything.

2018-01-07 12:43:01.441 [DBG] PEER: Sending UpdateAddHTLC(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, id=1274, amt=1910190 mSAT, expiry=1257881, hash=9e9405921e783ce87bf20f48d368d2bde55d61bf9e36248e5a87a72d171a3594) to 34.250.234.192:9735 2018-01-07 12:43:01.496 [DBG] PEER: Sending CommitSig(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, num_htlcs=0) to 34.250.234.192:9735 2018-01-07 12:43:01.517 [DBG] PEER: Received RevokeAndAck(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, rev=86e4e4f8b4375f8ca98079d1ad0837a2415c8b33dd247abc76c41d38de12861a, next_point=0373849f08f1aadf301eb419bd8fe624a8e1b1abedacead748154ea1f828f19242) from 34.250.234.192:9735 2018-01-07 12:43:01.532 [DBG] PEER: Received CommitSig(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, num_htlcs=0) from 34.250.234.192:9735 2018-01-07 12:43:01.542 [DBG] HSWC: ChannelPoint(30a1cb92dd366b90b78e32e6f9ac5ea5ac4134e3a74e613c9db3a8bc406a088c:0) forwarding 0 HTLC's 2018-01-07 12:43:01.551 [DBG] PEER: Sending RevokeAndAck(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, rev=b531d2b4195df9166a57afac02589ec0de7ac6f41f2cb0a8bb2477b2329a77ef, next_point=030ef265375207ee0869603c361607148d1e56d5a758632ff1af951326d478b7c6) to 34.250.234.192:9735 2018-01-07 12:43:01.571 [DBG] PEER: Received UpdateFailHTLC(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, id=1274, reason=20fc8dc09e23eff44ae3db4c74c02faca99c6bb8bb7321c5c476adecfae94c7e35cb60f132ecfb80c689f92845825c8192688b1c2bbf4e8ea5b4a617e27f871ea332a2a3e865ac11624a5ff04f6f479e7b65464ca6ab06100943e07ef0240dd49ab47a838f15aa1dc892c24f1e8134c39434a8ef039c2148670dabd1fcbde233fa7720cf4b850b81c2e27825511bac75c3531870733c31f837c82df4d352a8c2412b27fcdaaf328ed3a3d9a704c27b56ac7bb873aa6d579b6a0cdf8d5013f3737bb35ce3c53f53f19e3ca3b72ec11aa5ed9d9f02c113a23db91c4f456109c6967a363915a8b58eeacc790645fd8ed97c39ecbc50b33fd9cfa07b666d0ac64d92f5c25c819ba8aeff63ddee7ea1d9209c4a79aa64eef9f92efbf42da591404e91827fb26c) from 34.250.234.192:9735 2018-01-07 12:43:01.584 [DBG] PEER: Received CommitSig(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, num_htlcs=0) from 34.250.234.192:9735 2018-01-07 12:43:01.608 [DBG] PEER: Sending RevokeAndAck(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, rev=244f42d56d0436348840ad47931e6b0d4d2742694c4c22a68807db7c27ad6dc8, next_point=039699c265f40e3f32e6e503b87dd475dafc3a5a49097c05a9f28d915246b979ec) to 34.250.234.192:9735 2018-01-07 12:43:01.616 [DBG] PEER: Sending CommitSig(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, num_htlcs=0) to 34.250.234.192:9735 2018-01-07 12:43:01.640 [DBG] PEER: Received RevokeAndAck(chan_id=8c086a40bca8b39d3c614ea7e33441aca55eacf9e6328eb7906b36dd92cba130, rev=90c5e281bda7036d10e1dc6aaced454eac2632b098f5bfa0898d23e508bfd1e1, next_point=02eda842819df81919ca6651ad9c6397051790d5c589d6c6009bd6c623de9d1f72) from 34.250.234.192:9735 2018-01-07 12:43:01.650 [DBG] HSWC: ChannelPoint(30a1cb92dd366b90b78e32e6f9ac5ea5ac4134e3a74e613c9db3a8bc406a088c:0) forwarding 1 HTLC's 2018-01-07 12:43:01.650 [DBG] HSWC: Closed completed onion circuit for 9e9405921e783ce87bf20f48d368d2bde55d61bf9e36248e5a87a72d171a3594: (0:0:0, 3294) <-> (1256433:7:0, 1274) 2018-01-07 12:43:01.652 [ERR] HSWC: unable to de-obfuscate onion failure, htlc with hash(9e9405921e783ce87bf20f48d368d2bde55d61bf9e36248e5a87a72d171a3594): unable to decode error update (errors.errorString): TemporaryChannelFailure(update=(lnwire.ChannelUpdate)(0xc42079d3b0)({ Signature: (btcec.Signature)(0xc420bbf6e0)({ R: (big.Int)(0xc420b33320)(48832811375496015207959661304449105467071572793175724307506166905292457253441), S: (big.Int)(0xc420b33340)(109351511127532896365573795571033297586969600504322976730782758616365818856265) }), ChainHash: (chainhash.Hash) (len=32 cap=32) 2913000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f, ShortChannelID: (lnwire.ShortChannelID) 13238272:6422528:23121, Timestamp: (uint32) 467140609, Flags: (lnwire.ChanUpdateFlag) 144, TimeLockDelta: (uint16) 0, HtlcMinimumMsat: (lnwire.MilliSatoshi) 655360000 mSAT, BaseFee: (uint32) 655360000, FeeRate: (uint32) 0 }) ) 2018-01-07 12:43:01.652 [ERR] CRTR: Attempt to send payment 9e9405921e783ce87bf20f48d368d2bde55d61bf9e36248e5a87a72d171a3594 failed: TemporaryChannelFailure: unable to de-obfuscate onion failure, htlc with hash(9e9405921e783ce87bf20f48d368d2bde55d61bf9e36248e5a87a72d171a3594): unable to decode error update (errors.errorString): TemporaryChannelFailure(update=(lnwire.ChannelUpdate)(0xc42079d3b0)({ Signature: (btcec.Signature)(0xc420bbf6e0)({ R: (big.Int)(0xc420b33320)(48832811375496015207959661304449105467071572793175724307506166905292457253441), S: (big.Int)(0xc420b33340)(109351511127532896365573795571033297586969600504322976730782758616365818856265) }), ChainHash: (chainhash.Hash) (len=32 cap=32) 2913000000000933ea01ad0ee984209779baaec3ced90fa3f408719526f8d77f, ShortChannelID: (lnwire.ShortChannelID) 13238272:6422528:23121, Timestamp: (uint32) 467140609, Flags: (lnwire.ChanUpdateFlag) 144, TimeLockDelta: (uint16) 0, HtlcMinimumMsat: (lnwire.MilliSatoshi) 655360000 mSAT, BaseFee: (uint32) 655360000, FeeRate: (uint32) 0 }) ) 2018-01-07 12:43:01.652 [DBG] CRTR: Reporting edge 1381462693021941760 failure to Mission Control 2018-01-07 12:43:01.652 [DBG] CRTR: Mission Control returning prune view of 3 edges, 7 vertexes

halseth commented 6 years ago

@vegardengen Could you paste more of the logs following this? (you can use pastebin or something) Thanks!

vegardengen commented 6 years ago

I think this is a complete cycle. But ok, I'll put more. From start of payment until now. lnd.log

vegardengen commented 6 years ago

btw, this one I catched only because I have a cron script rsyncing the log directory...3 old log files is really not enough. Should be possible to tune or turn off old log pruning! When testing, you really like to be able to go back in time - days at least!

halseth commented 6 years ago

Was this log the result of trying to send the payment only once, or did you try several times?

From the logs it looks like it starts looking for a path for the payment, then keeps on doing this for several minutes. Would be very useful if you could try paying to a new invoice, and only trying it once while capturing the logs! Let it try for a while to see if it finishes :)

vegardengen commented 6 years ago

I tried doing it twice after a while, but only after minutes.

Can I easily cancel this loop? Will it disappear if I restart?

halseth commented 6 years ago

It is still looping from that same payment?

Yeah, try restarting, then see if you can reproduce with a new invoice.

vegardengen commented 6 years ago

Restarted, new invoice in loop...how long would you like me to wait?

halseth commented 6 years ago

give it ~5 minutes, then paste the logs 👍

What hardware are you running this on?

vegardengen commented 6 years ago

Running it on a digitalocean VPS. 3 GB memory, no CPU shortage. Even with this loop, there is still some idle CPU left....(even if lnd eats most of CPU here)

vegardengen commented 6 years ago

lnd.log

vegardengen commented 6 years ago

The VPS is dedicated to the lightning node.

halseth commented 6 years ago

It looks like route finding take so much time that pruned edges (that we try send a payment through, but fails) gets "unpruned" between attempts. Looks like we insert a pruned edge back into the graph after 5 seconds. Hence it will never converge on a graph where no routes are left to try. (roasbeef mentioned this scenario earlier)

If disk access is slow, I think that would be the main bottleneck.

@Roasbeef would know this better than I do.

vegardengen commented 6 years ago

Maybe have a counter for the edges, that says second time an edge is pruned, it'll have to wait 10 seconds?

vegardengen commented 6 years ago

...and so on?

vegardengen commented 6 years ago

and then decrease that again over time.