vsbogd commented 2 years ago

This issue started happening frequently on sound-spleeter service instance. It happens about two times per week.

In client logs one can see that client first receives "cannot get mutex for channel: {ID: 14}" error. And after this client receives "another transaction on channel: {ID: 14} is in progress".

In server logs one can see "rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}" before payment completed.

Daemon version:

# snetd version
INFO[0000] Cobra initialized                            
INFO[0000] Configuration file is not set, using default configuration 
version tag: v.5.0.1
built on: 2021-05-16_19:16:27
sha1 revision: b0217231faf2620ab0e6510dcfb122e8e7f453b1

One can see client and server logs below: Client log:

2021-10-01 20:03:16,983 [grpc-default-executor-27500] ERROR i.s.soundspleeter.Proxy - request[117402] completed with error
io.grpc.StatusRuntimeException: INTERNAL: cannot get mutex for channel: {ID: 14}
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
        at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...
2021-10-01 20:04:46,616 [grpc-default-executor-27500] ERROR i.s.soundspleeter.Proxy - request[117406] completed with error
io.grpc.StatusRuntimeException: FAILED_PRECONDITION: another transaction on channel: {ID: 14} is in progress
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
        at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
        at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
        at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Server log:

time="2021-10-01T20:03:13.847393841Z" level=debug msg="New gRPC call received" context="{MD: map [:authority:[sound-spleeter.singularitynet.io:7068] content-type:[application/grpc] grpc-accept-encoding:[gzip] snet-payment-channel-amount:[5805900] snet-payment-channel-id:[14] snet-payment-channel-nonce:[0] snet-payment-channel-signature-bin:[...] snet-payment-type:[escrow] user-agent:[grpc-java-netty/1.28.0]], Info: {/SoundSpleeter/spleeter true true}"
time="2021-10-01T20:03:13.84754259Z" level=debug msg="Return payment handler by type" paymentType=escrow
time="2021-10-01T20:03:16.98287053Z" level=debug msg="etcd Transaction took 3.000481805s"
time="2021-10-01T20:03:16.983087638Z" level=debug msg=payload payload="{\"type\":\"response\",\"registry_address_key\":\"0x247DEbEBB766E4fA99667265A158060018D5f4F8\",\"ethereum_json_rpc_endpoint\":\"https://mainnet.infura.io/v3/09027f4a13e841d48dbfefc67e7685d5\",\"request_id\":\"c5bmj0bbg458pmbl6s50\",\"organization_id\":\"snet\",\"service_id\":\"sound-spleeter\",\"group_id\":\"nZdFbyUlpWfOuTn0WpJCpKtQATrU6gxz6Wn9zAC2mno=\",\"service_method\":\"/SoundSpleeter/spleeter\",\"response_sent_time\":\"2021-10-01 20:03:16.982996817\",\"request_received_time\":\"2021-10-01 20:03:13.847349713\",\"response_time\":\"3.1356\",\"response_code\":\"Internal\",\"error_message\":\"rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}\",\"version\":\"v.5.0.1\",\"client_type\":\"\",\"user_details\":\"\",\"user_agent\":\"grpc-java-netty/1.28.0\",\"channel_id\":\"14\",\"username\":\"\",\"operation\":\"read\",\"usage_type\":\"apicall\",\"status\":\"failed\",\"start_time\":\"2021-10-01 20:03:13.847349713\",\"end_time\":\"2021-10-01 20:03:16.982996817\",\"usage_value\":1,\"time_zone\":\"\",\"payment_mode\":\"escrow\",\"user_address\":\"\"}"
time="2021-10-01T20:03:17.533955042Z" level=warning msg="Service call failed with status code : 500 "
time="2021-10-01T20:03:17.534002203Z" level=warning msg="Unable to publish metrics" payload="{\"type\":\"response\",\"registry_address_key\":\"0x247DEbEBB766E4fA99667265A158060018D5f4F8\",\"ethereum_json_rpc_endpoint\":\"https://mainnet.infura.io/v3/09027f4a13e841d48dbfefc67e7685d5\",\"request_id\":\"c5bmj0bbg458pmbl6s50\",\"organization_id\":\"snet\",\"service_id\":\"sound-spleeter\",\"group_id\":\"nZdFbyUlpWfOuTn0WpJCpKtQATrU6gxz6Wn9zAC2mno=\",\"service_method\":\"/SoundSpleeter/spleeter\",\"response_sent_time\":\"2021-10-01 20:03:16.982996817\",\"request_received_time\":\"2021-10-01 20:03:13.847349713\",\"response_time\":\"3.1356\",\"response_code\":\"Internal\",\"error_message\":\"rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}\",\"version\":\"v.5.0.1\",\"client_type\":\"\",\"user_details\":\"\",\"user_agent\":\"grpc-java-netty/1.28.0\",\"channel_id\":\"14\",\"username\":\"\",\"operation\":\"read\",\"usage_type\":\"apicall\",\"status\":\"failed\",\"start_time\":\"2021-10-01 20:03:13.847349713\",\"end_time\":\"2021-10-01 20:03:16.982996817\",\"usage_value\":1,\"time_zone\":\"\",\"payment_mode\":\"escrow\",\"user_address\":\"\"}" url="https://marketplace-mt-v2.singularitynet.io/metering/usage"
time="2021-10-01T20:03:24.188823457Z" level=debug msg="Payment completed"
time="2021-10-01T20:03:24.428001054Z" level=debug msg="etcd Transaction took 175.435209ms"
time="2021-10-01T20:03:24.428045194Z" level=debug msg="Channel unlocked"

anandrgitnirman commented 2 years ago

@vsbogd

Are you trying to get a new Token ?
Are you manually unlocking the channel ?
any rough idea on when you first saw this ?

vsbogd commented 2 years ago

Are you trying to get a new Token ?

No

Are you manually unlocking the channel ?

Yes, it is I need to do each time to make application working again.

any rough idea on when you first saw this ?

First time I saw this issue in log on Sept 27 at 04:48:49,077 GMT. Since that time it happened three times AFAIK.

anandrgitnirman commented 2 years ago

507 , we need to get to this soon , been pending a while now

We havent done any release in Daemon since a very long time. I see java sdk still uses the pay per use and not concurrency /token feature , either ways daemon should handle channel locks gracefully. I will try to re produce this scenario

anandrgitnirman commented 2 years ago

@vsbogd , can you also point me to client code ,might have more leads on possibly simulating the error

vsbogd commented 2 years ago

@vsbogd , can you also point me to client code ,might have more leads on possibly simulating the error

Client code is here: https://github.com/vsbogd/sound-spleeter-proxy (README.md explains how to run it), but it happens once per 8000 client requests so I don't think it is easily reproducible.

I think it is more simple to find the root cause by code analysis and then one could write a unit test to reproduce the behavior. I see the error is raised in the following line: https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/escrow/escrow.go#L98 First issue is that error is not logged nor imprinted into returned error, so we don't have enough information to understand what happened. I would fix it first.

There are two places inside Lock which can lead to the error: https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/escrow/lock.go#L36-L58 Taking into account the lock actually happens the most suspicious is CompareAndSwap call which is essentially ExecuteTransaction call: https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/etcddb/etcddb_client.go#L296-L325

Judging by "Transaction took" in log we can say CompleteTransaction was actually called. Btw the transaction took 3 seconds which looks like timeout value: https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/etcddb/etcddb_client.go#L329-L385

As transaction looks safe I would suppose that transaction was sent to ETCD server and received by server but response was not received because of timeout. Looks like ETCD finished transaction but client received timeout error and it led to the inconsistency.

What do you think? Does ETCD logs for this moment of time contains something interesting?

vsbogd commented 2 years ago

Btw, I am not sure about priority of this. Are we going to completely decommission escrow payment transactions and they are already deprecated?

If so may be we can simply migrate code to the tokens approach and finish the decommission. I cannot say when it will be possible though because we don't have resources to for doing this work.

anandrgitnirman commented 2 years ago

to me this is a production issue , and i dont want any body to have to do a manual workaround lets do this for now

fix the logging
increase the timeout to 5 sec on block chain ,

We can try simulating with a test case as suggested and deep dive, will post anything seen on etcd

singnet / snet-daemon

Channel is not unlocked after processing payment #553

507 , we need to get to this soon , been pending a while now