Open xsb opened 5 years ago
Hm, admittedly lndmon has not been tested on rpi-type hardware.
This is you attempting to hit the /metrics
endpoint on lnd
?
@Roasbeef lnd
uses port :8989
for the metrics. I forgot to mention that that part works fine, I get the output in just a few milliseconds.
Honestly I haven't spent much time trying to debug this, but neither Prometheus nor myself (from the cli) can hit the metrics endpoint on lndmon
(port :9092
) fast enough.
After some time debugging I found out that what is taking so long is the GraphCollector's DescribeGraph
request against lnd
. The frequency seems to be too high for that call.
GraphCollector is taking more than 30% of the cpu time (understandable, this is the biggest dataset being ingested). pprof
is not taking i/o into account so reality is much worse than what is shown in the flamegraph. The main issue then seems to be that lnd
is taking a few seconds to serve the whole graph. Would it be possible to make this call less often?
I changed my Prometheus config (slower interval + higher timeout) and I am running lndmon
on mainnet without issues now 😄.
diff --git a/prometheus.yml b/prometheus.yml
index 01797c0..81d781c 100755
--- a/prometheus.yml
+++ b/prometheus.yml
@@ -1,6 +1,7 @@
scrape_configs:
- job_name: "lndmon"
- scrape_interval: "20s"
+ scrape_interval: "30s"
+ scrape_timeout: "15s"
static_configs:
- targets: ['lndmon:9092']
- job_name: "lnd"
I am not saying this should be merged because it's totally arbitrary. A bigger network and/or a slower hardware device would require even more conservative defaults.
thanks for the reasearch @xsb i had the same problem. for me the scrape time was 30-50 seconds. i am using a rpi3 for lnd, connected to lndmon running in the cloud. uplink bandwidth is about 2-3Mb/s. i guess the slowdown is a combination of cpu load and bandwidth limit. i set the scrape interval and timeout to 60s. like this it seems to be working for now.
I am trying to use lnd+lndmon on a rock64 board (similar to rpi, with arm64 and 4GB RAM) but Grafana only shows data points coming directly from lnd (Go Runtime + Performance dashboard). Everything supposed to come from lndmon is not there.
I noticed that when running simple queryes with PromQL I immediately got the error: "the queries returned no data for a table". Then went to Explore section and checked for
up
, there I can see how the lndmon process is reported to be down, which is not true.After that I tried to get the metrics directly and I realized I was getting slow response times on the metrics endpoint (between 10s and 12s usually):
I haven't investigated this deeply yet but the instance has more than enough Ram, and the CPU usage and load average don't look that bad.
Will try to spend more time in another moment but wanted to report soon just in case it's happening to more people.