lightninglabs / lndmon

🔎lndmon: A drop-in monitoring solution for your lnd node using Prometheus+Grafana
MIT License
151 stars 47 forks source link

Performance issue with /metrics endpoint #28

Open xsb opened 5 years ago

xsb commented 5 years ago

I am trying to use lnd+lndmon on a rock64 board (similar to rpi, with arm64 and 4GB RAM) but Grafana only shows data points coming directly from lnd (Go Runtime + Performance dashboard). Everything supposed to come from lndmon is not there.

I noticed that when running simple queryes with PromQL I immediately got the error: "the queries returned no data for a table". Then went to Explore section and checked for up, there I can see how the lndmon process is reported to be down, which is not true.

After that I tried to get the metrics directly and I realized I was getting slow response times on the metrics endpoint (between 10s and 12s usually):

$ time curl -s --output /dev/null localhost:9092/metrics

real    0m10.717s
user    0m0.022s
sys 0m0.015s

I haven't investigated this deeply yet but the instance has more than enough Ram, and the CPU usage and load average don't look that bad.

Will try to spend more time in another moment but wanted to report soon just in case it's happening to more people.

valentinewallace commented 5 years ago

Hm, admittedly lndmon has not been tested on rpi-type hardware.

Roasbeef commented 5 years ago

This is you attempting to hit the /metrics endpoint on lnd?

xsb commented 5 years ago

@Roasbeef lnd uses port :8989 for the metrics. I forgot to mention that that part works fine, I get the output in just a few milliseconds.

Honestly I haven't spent much time trying to debug this, but neither Prometheus nor myself (from the cli) can hit the metrics endpoint on lndmon (port :9092) fast enough.

xsb commented 5 years ago

After some time debugging I found out that what is taking so long is the GraphCollector's DescribeGraph request against lnd. The frequency seems to be too high for that call.

xsb commented 5 years ago

GraphCollector is taking more than 30% of the cpu time (understandable, this is the biggest dataset being ingested). pprof is not taking i/o into account so reality is much worse than what is shown in the flamegraph. The main issue then seems to be that lnd is taking a few seconds to serve the whole graph. Would it be possible to make this call less often?

Screenshot 2019-08-08 at 13 29 22

xsb commented 5 years ago

I changed my Prometheus config (slower interval + higher timeout) and I am running lndmon on mainnet without issues now 😄.

diff --git a/prometheus.yml b/prometheus.yml
index 01797c0..81d781c 100755
--- a/prometheus.yml
+++ b/prometheus.yml
@@ -1,6 +1,7 @@
 scrape_configs:
 - job_name: "lndmon"
-  scrape_interval: "20s"
+  scrape_interval: "30s"
+  scrape_timeout: "15s"
   static_configs:
   - targets: ['lndmon:9092']
 - job_name: "lnd"

I am not saying this should be merged because it's totally arbitrary. A bigger network and/or a slower hardware device would require even more conservative defaults.

menzels commented 5 years ago

thanks for the reasearch @xsb i had the same problem. for me the scrape time was 30-50 seconds. i am using a rpi3 for lnd, connected to lndmon running in the cloud. uplink bandwidth is about 2-3Mb/s. i guess the slowdown is a combination of cpu load and bandwidth limit. i set the scrape interval and timeout to 60s. like this it seems to be working for now.