v8.x RPC requests stall due to 4 concurrent connection limit

martinwguy commented 11 years ago

On current Debian, litecoin quickly becomes incredibly slow, taking 8 minutes to respond to a local "litecoind getinfo" request. I have 4 miner daemons on other machines: some get a fast response, others cannot even connect, since the RPC request times out after 30 seconds.

Though the machine is not fast (a 600MHz ARM with 512MB RAM), it is not out of resources: CPU is a constant 5 to 10%, there is little disk activity, it is not swapping, there is just a little network activity. It-s as if it were waiting for something, as if there were a maximum of N connections or threads or something, and that sometimes RPC requests have to wait for something or other to be free, which might take up to 8 minutes.

While the unfortunate RPC clients are waiting, "debug.log" continues to report new transactions, assign work to miner daemons that had succeeded in connecting and so on.

This behaviour is new with the upgrade from 0.6 to 0.8.

wtogami commented 11 years ago

We have not done any development or testing on ARM. If this is a regression since 0.6.x then it is something that happened in Bitcoin's development between 0.6 and 0.8. Does bitcoind have the same issue on the same hardware? If so you need to report this issue at the Bitcoin github.

For the moment 0.6.9.2 is theoretically still compatible with the 0.8.3.7 network. The alert is because we cannot guarantee the safety of 0.6.x nodes and we have no intention to backport any fixes to 0.6.x.

wtogami commented 11 years ago

Also be advised that it is unwise to run litecoind on such slow hardware. Your miners are at a disadvantage with an increased likelihood of orphaned blocks due to the slowness of your node.

wtogami commented 11 years ago

Are you using getwork while this problem is happening? There is talk of Bitcoin 0.8.2+ having slow or deadlocked RPC while getwork is in use. Apparently the case is that most pools do not use getwork anymore and have all moved on to GBT. Thus getwork is slated for complete removal perhaps in Bitcoin 0.9, and if that happens litecoin will be removing it too.

If you really want to test this theory, we have a somewhat working 0.8.1 Litecoin branch that was aborted a while ago. It might be compatible with the current network, we aren't really sure though. It would help to verify this issue though. You want to try this?

martinwguy commented 11 years ago

Hi! It was observed using "litecoind getinfo" or any other command on the same host. In other time periods the return is immediate, so it seems to be phase-of-moon-dependent. I am not able to reproduce the failure at will; it seems to come and go but when it is happening is it persistent. I see that another user reported the same symptom https://bitcointalk.org/index.php?topic=57445.0 This was on 32-bit Debian ARM stable ("wheezy")

I have now moved to getblocktemplate using P2Pool on a different server and don't experience the problem any more. Obviously, if you can't reproduce the symptom, it will be impossible to fix. The only clues I have is that is was never experienced using 6.* on the same host under the same workload, and that the host is not cpu-bound, memory-bound, network-bound or swapping, and that existing rpc connections using getwork seemed to continue working. The host hasn't demonstrated similar behaviour with any other programs since I started using it in 2008.

...unless you think it might be the presence if getwork capability that makes all RPC commands subject to long waits...

If I can find out how to reproduce the problem reliably I'll write again, but unless we can provike it at will it may be too difficult to test.

If I can be of further assistance please write again

On 23/08/2013, Warren Togami notifications@github.com wrote:

Are you using getwork while this problem is happening? There is talk of Bitcoin 0.8.2+ having slow or deadlocked RPC while getwork is in use. Apparently the case is that most pools do not use getwork anymore and have all moved on to GBT. Thus getwork is slated for complete removal perhaps in Bitcoin 0.9, and if that happens litecoin will be removing it too.

If you really want to test this theory, we have a somewhat working 0.8.1 Litecoin branch that was aborted a while ago. It might be compatible with the current network, we aren't really sure though. It would help to verify this issue though. You want to try this?

Reply to this email directly or view it on GitHub: https://github.com/litecoin-project/litecoin/issues/67#issuecomment-23164582

martinwguy commented 11 years ago

Sorry, to be clear: yes, it was being called by three remote miners that used getwork (pooler-cpuminer), one in switzerland, one in germany, one elsewhere in the same country as the daemon. No local miners were running (it was just the wallet hub for fast miners)

M

On 23/08/2013, Martin Guy martinwguy@gmail.com wrote:

Hi! It was observed using "litecoind getinfo" or any other command on the same host. In other time periods the return is immediate, so it seems to be phase-of-moon-dependent. I am not able to reproduce the failure at will; it seems to come and go but when it is happening is it persistent. I see that another user reported the same symptom https://bitcointalk.org/index.php?topic=57445.0 This was on 32-bit Debian ARM stable ("wheezy")

I have now moved to getblocktemplate using P2Pool on a different server and don't experience the problem any more. Obviously, if you can't reproduce the symptom, it will be impossible to fix. The only clues I have is that is was never experienced using 6.* on the same host under the same workload, and that the host is not cpu-bound, memory-bound, network-bound or swapping, and that existing rpc connections using getwork seemed to continue working. The host hasn't demonstrated similar behaviour with any other programs since I started using it in 2008.

...unless you think it might be the presence if getwork capability that makes all RPC commands subject to long waits...

If I can find out how to reproduce the problem reliably I'll write again, but unless we can provike it at will it may be too difficult to test.

If I can be of further assistance please write again

On 23/08/2013, Warren Togami notifications@github.com wrote:

Are you using getwork while this problem is happening? There is talk of Bitcoin 0.8.2+ having slow or deadlocked RPC while getwork is in use. Apparently the case is that most pools do not use getwork anymore and have all moved on to GBT. Thus getwork is slated for complete removal perhaps in Bitcoin 0.9, and if that happens litecoin will be removing it too.

If you really want to test this theory, we have a somewhat working 0.8.1 Litecoin branch that was aborted a while ago. It might be compatible with the current network, we aren't really sure though. It would help to verify this issue though. You want to try this?

Reply to this email directly or view it on GitHub: https://github.com/litecoin-project/litecoin/issues/67#issuecomment-23164582

big-big-big-yoshi commented 11 years ago

I also see this issue.

To reproduce: open 4 rcp-clients to litecoind and send getinfo but do not close the connection after the response. The fifth client will hang and all subsequent connections will hang. Currently established connections appear to continue to work properly.

prior versions do not show this behavior.

wtogami commented 11 years ago

"prior versions" mean the ancient 0.6.x codebase?

As noted above, there seems to be a known bug added in bitcoin-0.8.2+ where getwork causes RPC failure. If you want to test this theory, try the "devhistory-0.8.1-aborted" branch which was an early but aborted attempt to rebase Litecoin onto bitcoin-0.8.1. I think it will be compatible with the network, although it may not be compatible with wallets created by 0.8.3.x. Note that 0.8.1 is not safe to operate. It would just be an interesting test

big-big-big-yoshi commented 11 years ago

"prior versions" mean the ancient 0.6.x codebase? -> yes

I don't believe what I am seeing is directly related to getwork as I can recreate the issue on my test network with 4 clients issuing a simple getinfo and not closing the connection. The fifth will stall.

If I test while running pushpool - then this pushpool connection counts as one, and it only takes three more to encounter the stall.

Judging from what I see in a pcap - the fifth connection is accepted and I can see that the request is issued - there is simply no response from litecoind.

In the debug log - I see the fifth getinfo come into the system - just no response.

ThreadRPCServer method=getinfo keypool reserve 2 keypool return 2

big-big-big-yoshi commented 11 years ago

I built and deployed devhistory-0.8.1-aborted into my test network.

It behaves like the current 8.3.7 release with respect to the stall.

big-big-big-yoshi commented 11 years ago

I think by default the code creates 4 threads for the rpc connections - it will accept a new connection even if all four are busy. giving the impression of stalled transactions.

for (int i = 0; i < GetArg("-rpcthreads", 4); i++)

when I recompiled with the above set to:

for (int i = 0; i < GetArg("-rpcthreads", 64); i++)

as expected, it takes much longer to encounter the issue.

I tried to set rcpthreads from the commandline as follows but this didn't work for me, the recompile did.

/home/client/TEST_NETWORK/bin/litecoind -datadir=/home/client/TEST_NETWORK/datadir -testnet -addnode=testnet.litecointools.com --rcpthreads=64 --daemon

/home/client/TEST_NETWORK/bin/litecoind -datadir=/home/client/TEST_NETWORK/datadir -testnet -addnode=testnet.litecointools.com -rcpthreads=64 --daemon

wtogami commented 11 years ago

Bitcoin dev said ... "if your mining chews up your 4 threaded connections with keepalives it will run great but block your rpcs." and "older versions didn't support keepalive".

So this explains why so few people are reporting any problems. The above workaround of increasing the limit may work if you use an expected number of concurrent keepalive connections.

Not sure where you got --rcpthreads from. That or the correctly spelled --rpcthreads doesn't exist in the source at all.

big-big-big-yoshi commented 11 years ago

Yep - typo it should be rpcthreads - which exists in bitcoinrpc.cpp line 845.

I imagine if correctly specified on the command line that the initial reporter would have been able to avoid the stalls all together.

A properly behaved server could refuse a new connection when out of resources - or - better yet perhaps return a meaningful error - maybe 503 service unavailable - but it should not accept a connection - take a request and return nothing...

Its a bug - but also an avoidable issue by upping the thread count if you encounter it.

wtogami commented 11 years ago

This needs to be solved in upstream Bitcoin. As noted in the below chat, they have reasons for this not being considered a priority to fix.

Freenode #bitcoin-dev warren: You folks getting complaints from vendor and/or pools about the 4 RPC thread keep alive limit? Let's think about a real solution for that for bitcoin-0.9 ... gmaxwell: No. The only person whos commented on it is doublec. (only pool at least) Most modern pools don't operate in a way where it matters. warren: gmaxwell had a simple idea along the lines of refusing keep alive for the 4th and last connection. That would seem to work, as earlier versions of bitcoin lacked keepalive. But a limit of 4 seems arbitrary... although I do recognize there must be a limit. gmaxwell: warren: we waste a bunch of ram (and a ton of VM) for per-thread stacks and heap. warren: would your earlier idea be good enough? gmaxwell: probably. I haven't seen any cases where people really had to keep more than three keepalives up. I don't know that it would be trivial to implement though.

martinwguy commented 11 years ago

Naive suggestion: When a keepalive connection is instantiated, fork another thread for it, leaving 4 non-keepalive threads for regular requests. of course, I have no idea how easy that would be to implement...

wtogami commented 11 years ago

Please make suggestions to Bitcoin.

coding-idiot commented 10 years ago

I am facing the same issues, with the latest versoin of litecoin. 3-4 connections and the next connection just times out.

Also, I didn't found any keepalive or rpcthreads command-line argument. Is there any so ? If yes, then how to set the keepalive time ?

I am having a fairly good hardware, as reported initially : Win 7 64-bit with core i5 supported by 8GB RAM

but as I see, it's hard-coded to serve upto only 4 threads.

Anways, that was about 6 months ago, any update since ?

wtogami commented 10 years ago

This is still an issue in the latest version of Bitcoin.

ghost commented 10 years ago

Do you mean this was tested and still persists in 0.9?

On Thu, Mar 13, 2014 at 10:10 PM, Warren Togami notifications@github.comwrote:

This is still an issue in the latest version of Bitcoin.

Reply to this email directly or view it on GitHubhttps://github.com/litecoin-project/litecoin/issues/67#issuecomment-37608860 .

wtogami commented 10 years ago

Yes.

litecoin-project / litecoin

v8.x RPC requests stall due to 4 concurrent connection limit #67