Withdraw not working after crashing the task a few times during withdraw

christianbrb commented 3 years ago

Thanks for filing a bug report :-)

Steps to Reproduce

Navigate to https://lightclient.raiden.network/staging
Make a withdraw
Kill the browser task during withdraw
Make this a few times
Make a withdraw without killing the browser task

Expected Result

Withdraw works

Actual Result

Withdraw does not work (Spinner keeps spinning and no MetaMask pop-up)
No console error
Logs: Raiden_dApp.log

[wait one night] -> Result from #2184 for completness

Transfer failed (Lock expired)
raiden.logs.txt

Additional Infos (Optional)

- Firefox 80.0.1

christianbrb commented 3 years ago

Maybe an issue with the Raiden hub client? @weilbith Could you please upload the logs of the Raiden hub of today? Thanks :)

weilbith commented 3 years ago

Here we go. Don't get confused by the date in the file name. Paul enlightened me how the log rotating works for the Python client. ^^ raiden-debug_2020-09-27T12:00:44.641239.log

christianbrb commented 3 years ago

@weilbith Thanks :) I don't see an error within the logs.

@andrevmatos Could you please check the logs? I have retried and it is still not working.

andrevmatos commented 3 years ago

Ok, I did read the logs from our side, but couldn't get anything on hub's side. Looks like, while the events on the LC end at ~7h11am, the logs on the hub start at 9h05am. @weilbith can you check to get the logs from the proper time range from the hub, please? From our side, what I could see that happened, is expected:

after some transfers and successful withdraws
LC was requested a new withdraw; it did persist it, receive a confirmation but was killed before signing/broadcasting the transaction
- This is expected, we do persist the confirmation, but don't retry the tx; it shouldn't be fatal, it's not an error to NOT send a withdraw tx, and just let it expire
A transfer was successfully made after that, indicating the clients were still in sync
Then a new withdraw was requested, while the previous one was still pending; it shouldn't be a big deal either, we do support parallel withdraw requests (total_withdraw is monotonic, so it should be safe to sign confirmations to partner's requests as long as their balance [not including pending withdraws] allows the requested withdraw), but looks like python didn't like receiving a new WithdrawRequest without the previous one either getting confirmed or expiring.
- @palango can you confirm raiden-py doesn't like the second withdraw request while the first one was still pending? Any reason why it shouldn't?
- An alternative explanation for the request not being accepted by raiden-py is that it took into account the balance minus the pending withdraw, while it should not subtract the pending value since withdraw is a monotonic value, and later, larger withdraws can supersede smaller ones, as long as partner has enough balance. We took some time to get it to work properly

Again, this doesn't look like an issue with persistence, and instead a protocol issue because we're more relaxed than raiden-py on constraints on handling and supporting parallel withdraws. Possible solutions include retrying pending withdraws for which we've received confirmations for, erroring on new withdraw request before succeeding or expiring previous one, or fixing raiden-py.

weilbith commented 3 years ago

@andrevmatos to make this more specific: you need the logs from the 27th of September 2020 including the messages before and after 7am? If so we have a problem: there are no log entries for that time. Doing a recursive grep with the format of the timestamp used in he logs I get no results (i.e. grep -R "2020-09-27 07:" ./). I tried this approach because the naming of the log files is confusing to me.

andrevmatos commented 3 years ago

@weilbith the logs on the LC side goes from 2020-09-28T06:44:14 to 2020-09-28T07:11:21. But anyway, it's very weird not to have logs available. Is there any kind of log rotation available? Isn't the hub running 24/7?

christianbrb commented 3 years ago

@andrevmatos I hope, that I haven't confused you. There two logs of the dApp for the two issues. I have attached the logs of the second issue to the description of this issue now. No update for the Hub from my side.

andrevmatos commented 3 years ago

No confusion, I'm checking the logs on this issue, and there's no time intersection between the logs from the dApp and the hub. Thore said there're no logs on the hub for this time frame, which is very weird, it's harder to debug a raiden-py issue without logs.

weilbith commented 3 years ago

@andrevmatos if you could give me the fingerprint of a SSH key you control, I could give you access to the hub and point you to the logs directory. I think that is more efficient.

palango commented 3 years ago

@andrevmatos Sorry for missing the comment. I'm looking into it and will update you as soon as I have looked into the python client

christianbrb commented 3 years ago

@palango to create an issue on the Raiden python repo to reproduce the issue in an integration test.

palango commented 3 years ago

@christianbrb Do you remember how big the withdraws were in relation to the channel balance and deposits?

palango commented 3 years ago

Issue in Python client: https://github.com/raiden-network/raiden/issues/6599

christianbrb commented 3 years ago

@palango The withdraw was 0.0001 TTT and the prior SetTotalDeposit was the same 0.0001 TTT.

I've analyzed this by looking at the transactions: https://goerli.etherscan.io/address/0xb808877b5ec745183bd60ce8c89aba83011d56f6

palango commented 3 years ago

Mh, that explains it. @andrevmatos

Channel has a balance of 1T
We're at the state before the withdraw that was interrupted. LC sends the withdraw request for 1T to the PC (python client)
The PC stores the withdraw in it's pending_withdraws list and sends the confirmation
The LC never goes on-chain, so the pending withdraw stays in the list.
A transfer is send and changes nothing to the pending withdraw (just the channel nonce)
Another withdraw request for 1T is sent.
The PC receives it
- It checks the channel balance. This return the deposits - max(sender.offchain_total_withdraw, sender.onchain_total_withdraw),
- This is 1T - 1T = 0T, as there's already a withdraw pending. This means no additional balance for withdraw is available and the request is denied.

@andrevmatos I don't see a problem in the python client here. How could this be handles differently?

andrevmatos commented 3 years ago

The problem IMO is that it shouldn't deduce pending offchain_total_withdraw from the deposits to calculate the balance: total_withdraw is a monotonic value; from what I can see, PC is including current offchain_total_withdraw in the value which it calculates as potential withdraw for partner; if I had a deposit - transferred of 5T (balance), a offchain_total_withdraw of 3T, I should be able request a new offchain_total_withdraw of 4T, since it's still <= 5, and this new request should supersede the previous of 3T.

andrevmatos commented 3 years ago

For this, balance and totalWithdrawable should be different values (and this is how we calculate it on the LC):

balance: called capacity on LC, used to calculate how much partner can transfer to/through us) must also deduce offchain_total_withdraw (pending), or else they could steal from us
totalWithdrawable (withdraw balance): used to calculate how much partner can withdraw, must not deduce current/pending/largest offchain_total_withdraw, since it's safe and acceptable for them to request a bigger monotonic value, as long as it doesn't overcome their current deposit + transfers to us

It's a sensible difference, but seems to be the issue here.

palango commented 3 years ago

That makes sense, so the PT should just sign the request again, because it doesn't do any harm. (as long as the partner hasn't lost so much balance that he cannot do it any more, but then PT shouldn't have accepted the payment before)

andrevmatos commented 3 years ago

Yes, exactly. As long as the request doesn't "steal" from PC, and this new pending request is accounted for future transfers, PC should accept and sign this request.

christianbrb commented 3 years ago

@andrevmatos @palango Thanks for the great joint analysis :)

I am closing the issue as it is getting solved within the python client (see raiden-network/raiden#6599)

raiden-network / light-client