nixd dies when making nix-cli/rpc calls.

ianw1974 commented 6 years ago

nixd becomes unresponsive after a certain about of time, when nix-cli/rpc calls are made. Timescale is not always the same. Can happen even 20 minutes after starting nixd, in most cases failing between 1 - 3 times in a 24 hour period, some cases can run up to two days without issues. The following types of calls are made:

Every 15 minutes: nix-cli ghostnode list full Every one hour: nix-cli -getinfo (can also be substituted with one of or a combination of getblockchaininfo, getnetworkinfo)

Amount of calls being made are not unreasonable, nor excessive. The following behaviour has been noted.

When the problem starts, issuing nix-cli commands ends up with no response coming back. When running from the console (Linux), a flashing cursor is displayed as if it is waiting for nixd to return the results. This does not happen.
After a while, the behaviour from point 1 then changes, and instead of a flashing cursor waiting for a response, the following error appears:

nix-cli -getinfo error: couldn't parse reply from server

nix-cli getblockchaininfo error: couldn't parse reply from server

nix-cli getnetworkinfo error: couldn't parse reply from server

In debug.log the following errors start to appear:

2018-09-14 02:20:02 socket sending timeout: 1201s 2018-09-14 02:20:02 socket sending timeout: 1201s 2018-09-14 02:20:02 socket sending timeout: 1201s 2018-09-14 02:20:02 socket sending timeout: 1201s 2018-09-14 02:20:02 socket sending timeout: 1201s 2018-09-14 02:20:02 socket sending timeout: 1201s 2018-09-14 02:20:02 socket sending timeout: 1201s 2018-09-14 02:20:02 socket sending timeout: 1201s

which then causes the following messages to appear when nix-cli/rpc calls are made:

2018-09-14 09:30:01 WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting 2018-09-14 09:30:01 WARNING: request rejected because http work queue depth exceeded, it can be increased with the -rpcworkqueue= setting

At this point, there is no way to shut down nixd safely, it can only be killed with kill -9

The errors in the debug.log relating to rpcworkqueue are false information, attempting to edit and change this parameter in nix.conf

by increasing the value doesn't resolve the problem. The problem is actually a locking issue with nixd. This problem has been seen before with Fixed Trade Coin, and also Syscoin. Both fixed the issues by addressing the locking problem. This can be found on Syscoin's github, from commits around May 5. The following commits are related to this issue with the appropriate fix (this exact same behaviour was noted with both Fixed Trade Coin and Syscoin and reported to them through their Discord/Slack channels):

https://github.com/syscoin/syscoin/commit/dbe0afd572d8a71e3333b4a9d019a9af8877d0e5

https://github.com/syscoin/syscoin/commit/6f1e10f355617ca9d5d027c038ee1e8221351e26

Attached debug.log.

Platform: Ubuntu 16.04 x86_64. Compiled as per Nix Platform instructions.

debug.log

mattt21 commented 6 years ago

This issue has not been replicate-able with the latest update. Can you expand on your environment.

ianw1974 commented 6 years ago

I expect perhaps you didn't wait long enough. As explained, there is no timescale to this occuring, it's random, can do it in a few hours, two days, one day.

How to replicate:

Run nixd -daemon
Create scripts with the following:

/usr/local/bin/ghostnodelist.sh

#!/bin/bash
nix-cli ghostnode list

/usr/local/bin/nix-info.sh

#!/bin/bash
nix-cli -getinfo

last script can be getnetworkinfo instead of -getinfo, it's irrelevant which.

Create cron with:

*/15 * * * *    root    /usr/local/bin/ghostnode_list.sh
*/60 * * * *    root    /usr/local/bin/nix-info.sh

then wait for it to fail. This can be replicated on any server, so it's not specific to any environment.

ianw1974 commented 6 years ago

Version tested and failed

nix-cli -getinfo
{
  "version": 2000300,

if you need more info, please ask exactly what you require, but problem still exists, links provided above in original post provide fix, so just need to be applied to your codebase.

mattt21 commented 6 years ago

Not reproducible so far while running for 1 day straight with faster cronjobs. Code you linked also does not provide any fixes for NIX, what those commits have are already fixed in NIX. You need to be more specific in your environment, how you are running NIX, what your conf reads etc.

ianw1974 commented 6 years ago

I wrote above it can even take up to two days, the fact you ignore what I wrote by testing for 1 day just confirms it. Increasing the frequency won't make it fail any earlier, it happens at random. Already explained my environment above. nix.conf only has standard rpcuser and rpcpassword.

And the above links have the fix, syscoin fixed it by fixing the locking issues which is exactly this problem, but you fail to acknowledge it. Just like when I reported it via discord. A waste of my time. You can close the issue, I'm not interested in helping fix this, when you can't be bothered to read the above and test appropriately and properly.

mattt21 commented 6 years ago

I challenge you to look at our code and find where there is a locking issue that is fixed by the commits you provided(Hint: you cant). You are not giving enough info on your problems. The fact that there are 100's of ghostnodes running with 0 issues and many providers tracking diagnostics prove that this issue is minor and due to certain environments that you seem to keep creating and not giving enough info. If you cannot push a issue that you can attempt to help solve for your sake(linking code that is irrelevant to our codebase doesnt help anyone, if you dont believe me, take a few seconds comparing the code you linked instead of just linking it if you can understand what it all is saying). Closed for lack of understanding how to properly open a issue, problem is your environment.

MNPJason commented 5 years ago

Hello NIX team, I am bringing this back alive as i am receiving the same results. Issue happens every 2 hours - 2 days at random. I currently have to kill and reindex the whole wallet to keep the stats active on MasterNodes.Pro. We are looking into processing the data another way to lower the amount of API calls to the daemon. But our current way which works with all other wallets is having a issue. I though you would like to know this.

koifinance / NixCore

nixd dies when making nix-cli/rpc calls. #16