Open vsbogd opened 2 years ago
@vsbogd
Are you trying to get a new Token ?
No
Are you manually unlocking the channel ?
Yes, it is I need to do each time to make application working again.
any rough idea on when you first saw this ?
First time I saw this issue in log on Sept 27 at 04:48:49,077 GMT. Since that time it happened three times AFAIK.
We havent done any release in Daemon since a very long time. I see java sdk still uses the pay per use and not concurrency /token feature , either ways daemon should handle channel locks gracefully. I will try to re produce this scenario
@vsbogd , can you also point me to client code ,might have more leads on possibly simulating the error
@vsbogd , can you also point me to client code ,might have more leads on possibly simulating the error
Client code is here: https://github.com/vsbogd/sound-spleeter-proxy (README.md explains how to run it), but it happens once per 8000 client requests so I don't think it is easily reproducible.
I think it is more simple to find the root cause by code analysis and then one could write a unit test to reproduce the behavior. I see the error is raised in the following line: https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/escrow/escrow.go#L98 First issue is that error is not logged nor imprinted into returned error, so we don't have enough information to understand what happened. I would fix it first.
There are two places inside Lock
which can lead to the error:
https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/escrow/lock.go#L36-L58
Taking into account the lock actually happens the most suspicious is CompareAndSwap
call which is essentially ExecuteTransaction
call:
https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/etcddb/etcddb_client.go#L296-L325
Judging by "Transaction took" in log we can say CompleteTransaction
was actually called. Btw the transaction took 3 seconds which looks like timeout value:
https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/etcddb/etcddb_client.go#L329-L385
As transaction looks safe I would suppose that transaction was sent to ETCD server and received by server but response was not received because of timeout. Looks like ETCD finished transaction but client received timeout error and it led to the inconsistency.
What do you think? Does ETCD logs for this moment of time contains something interesting?
Btw, I am not sure about priority of this. Are we going to completely decommission escrow payment transactions and they are already deprecated?
If so may be we can simply migrate code to the tokens approach and finish the decommission. I cannot say when it will be possible though because we don't have resources to for doing this work.
to me this is a production issue , and i dont want any body to have to do a manual workaround lets do this for now
We can try simulating with a test case as suggested and deep dive, will post anything seen on etcd
This issue started happening frequently on sound-spleeter service instance. It happens about two times per week.
In client logs one can see that client first receives "cannot get mutex for channel: {ID: 14}" error. And after this client receives "another transaction on channel: {ID: 14} is in progress".
In server logs one can see "rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}" before payment completed.
Daemon version:
One can see client and server logs below: Client log:
Server log: