lightninglabs / lndmon

🔎lndmon: A drop-in monitoring solution for your lnd node using Prometheus+Grafana
MIT License
149 stars 47 forks source link

Getting ResourceExhausted errors #48

Closed torkelrogstad closed 4 years ago

torkelrogstad commented 4 years ago

When running lndmon against a testnet node, I'm seeing the following logs:

lndmon_1      | 2019-11-12 13:04:38.461 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:04:58.319 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:05:19.721 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:05:38.079 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:06:01.603 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:06:18.419 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:06:40.344 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:07:04.836 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:07:36.522 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:07:50.300 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)
lndmon_1      | 2019-11-12 13:08:16.318 [ERR] WALT: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)

Is this related to this issue? https://github.com/lightningnetwork/lnd/pull/2374 I'm not familiar with how lndmon connects to lnd, but could it be that the connecting client need to bump the max block size?

guggero commented 4 years ago

This is exactly the same issue. But lndmon uses the same 50 MB limit that lncli uses. So far only the describegraph command hit that limit before it was increased.

Do you have a huge number of on-chain transactions or unspent coins? Can you please try if lncli listunspent or lncli listchaintxns give you the same error?

torkelrogstad commented 4 years ago
$ lncli --network testnet listchaintxns
[lncli] rpc error: code = ResourceExhausted desc = grpc: received message larger than max (61820513 vs. 52428800)

That seems like it's it! I'm not sure how I managed to rack up such a large amount of data for onchain TXs, though...

Is there any workaround for this?

guggero commented 4 years ago

Unfortunately it's not fixed too quickly. The code to connect to lnd is taken from the loop repo. So we need to increase the limit there and then update the dependency in lndmon.

If you compile from source, you can replace the call to loop.NewBasicClient with the following code:

var maxMsgRecvSize = grpc.MaxCallRecvMsgSize(1 * 1024 * 1024 * 100)

func NewBasicClient(lndHost, tlsPath, macDir, network string, basicOptions ...BasicClientOption) (
    lnrpc.LightningClient, error) {
    if tlsPath == "" {
        tlsPath = defaultTLSCertPath
    }

    // Load the specified TLS certificate and build transport credentials
    creds, err := credentials.NewClientTLSFromFile(tlsPath, "")
    if err != nil {
        return nil, err
    }

    // Create a dial options array.
    opts := []grpc.DialOption{
        grpc.WithTransportCredentials(creds),
    }

    if macDir == "" {
        macDir = filepath.Join(
            defaultLndDir, defaultDataDir, defaultChainSubDir,
            "bitcoin", network,
        )
    }

    // Starting with the set of default options, we'll apply any specified
    // functional options to the basic client.
    bco := defaultBasicClientOptions()
    bco.applyBasicClientOptions(basicOptions...)

    macPath := filepath.Join(macDir, bco.macFilename)

    // Load the specified macaroon file.
    macBytes, err := ioutil.ReadFile(macPath)
    if err == nil {
        // Only if file is found
        mac := &macaroon.Macaroon{}
        if err = mac.UnmarshalBinary(macBytes); err != nil {
            return nil, fmt.Errorf("unable to decode macaroon: %v",
                err)
        }

        // Now we append the macaroon credentials to the dial options.
        cred := macaroons.NewMacaroonCredential(mac)
        opts = append(opts, grpc.WithPerRPCCredentials(cred))
        opts = append(opts, grpc.WithDefaultCallOptions(maxMsgRecvSize))
    }

    // We need to use a custom dialer so we can also connect to unix sockets
    // and not just TCP addresses.
    opts = append(
        opts, grpc.WithDialer(
            lncfg.ClientAddressDialer(defaultRPCPort),
        ),
    )
    conn, err := grpc.Dial(lndHost, opts...)
    if err != nil {
        return nil, fmt.Errorf("unable to connect to RPC server: %v", err)
    }

    return lnrpc.NewLightningClient(conn), nil
}
torkelrogstad commented 4 years ago

I had to copy a bit more code from the loop repo, but that worked like a charm! Thanks for helping out. Feel free to close this issue if you consider it resolved.

guggero commented 4 years ago

Ok, good to hear you were able to resolve it for now. Could you maybe give us some stats about your node so we can decide how likely this is to affect other people? Stuff like: age of the node, number of open channels / closed channels, type of activity (just routing node or shop backend). Thanks!

torkelrogstad commented 4 years ago

As mentioned above, this is a testnet node. Here are some stats:

$ lncli --network testnet listchannels | jq ".channels | length"
6
$ lncli --network testnet closedchannels | jq ".channels | length"
11
$ lncli --network testnet listchaintxns > chaintxns.json
$ jq '.transactions | length ' chaintxns.json
2341 

I haven't looked at the file in detail, but a lot of the TXs are from a few months ago, where I used this node to run a lndhub instance. There's also been some loop testing on this, but I'm not sure how much volume that was. I don't know specifically when the node was started, but the first onchain TX was back in early March of this year.

guggero commented 4 years ago

Thank you for the info. 2341 transactions doesn't sound abnormally large. I'll create a PR to fix this.

guggero commented 4 years ago

This was automatically closed because I referenced this in lightninglabs/loop#111 incorrectly. But it needs a follow-up PR to pull in that version where it's fixed. Reopening.