[feature request] Provide prometheus endpoint for statistic monitoring

baryluk commented 6 years ago

It could be separate daemon program, that uses lnd APIs, or built into lnd itself to provide even more statistics.

Useful statistics:

status (i.e. synced or not)
fee estimates
number of transactions performed by type, both on-chain and off-chain
total balances of different things
number of current channels (total and divided by their status, i.e. active, pending, closing, etc.)
size of network and network statistics, similar but not limited to getnetworkinfo
oldest and youngest channel, biggest and smallest channel, or some percentiles for lnd node
aggregated network (TCP/UDP) traffic in and out
number of connected peers
number of outbound payments and total balances done or in progress
number of incoming invoices by status and their totals
total number of lncli/API RPCs and their latency and status (OK, Error), aggregated totals, and by status, or API type.
current fee policy
fwding history aggregates, including total earned fees, and successed and errors (i.e. rejections, etc)
used memory and total CPU usage
uptime
state of wallet (locked / unlocked)
bitcoin backend status, i.e. ok, not ok, number of zmq messages received / processed, and their latency, blockchain height,
internal statistics: number of broadcasts, aggregated size of broadcast batches, number of graph updates, number of new channels discovered, number of orphaned channels removed, etc.
go related statistics: go routine counts, threads, HTTP stats, RPC stats, context switches, IO stats, etc.
number of log lines in log file by status (error, info, warn, etc)
database related stats, i.e. size of db, aggregated read and write io, size of write log, etc.
invalid handshakes, peer counts by protocol version and transport (i.e. ipv6/ipv4)
size of DoS protection database (in the future), and number of peers by status and/or score percentile, etc
size (in elements and bytes) of various internal caches, and their hit and miss counts (so hit ratio rates can be calculated in prometheus, etc).

Roasbeef commented 6 years ago

Sound suggestion, we have a little monitoring+alert toolkit planned. I took at look at Prometheus and it looks like a very viable candidate. I particularly like its query language, and also custom alert thresholds.

andrewshvv commented 6 years ago

Current I am doing this manually by fetching info about lnd and using unix tools to get info about lnd process. I use Prometheus for this, so, +1 for Prometheus integration.

baryluk commented 6 years ago

Yeah, I am also writing python tool that takes every 5 minutes data from lncli getinfo, lncli listinvoices, listchaintxns, lncli describegraph, lncli listpeers, lncli fwdinghistory, lncli getnetworkuinfo, and stdout data and logfile, and then put it into file plus do some automatic additional graph metrics calculations (like median path capacity, and other weird things), and then draw using gnuplot, but my intention is to put it in prometheus, as it is just for that and query language and console is good one.

Having it built-in would be: 1) faster, 2) safer, 3) provide more internal metadata (especially things like wallet and channel database metadata, log sizes, but also network IO, go routines stats, and various failure counters, i.e. failed payments counts, failed connections, zmq messages received, API latencies per API type, etc), 4) should be easier to integrate and test and add more metrics in the future.

baryluk commented 6 years ago

https://github.com/prometheus/client_golang#instrumenting-applications appears to be official Go library for instrumenting Go code. There are some examples there too, and plenty of tests. It requires HTTP server for serving metrics, but lnd already has one, and I think it should be fine to not require authentication (especially if only listening on ::1 / 127.0.0.1), as long as there are no private data exposed via metric labels and metadata. Or separate HTTP server on separate port can be used (this has advantage of being able to provide some metrics and status even if wallet is still locked).

https://prometheus.io/docs/practices/naming/ provides some good guidelines on naming variables / metrics. Some common prefix like lnd_ for crypto / lightning specific data would be useful. There might be some 3rd party library that provides instrumentation of Go runtime, CPU and Memory usage, uptime and stats of HTTP servers out of the box with standarized naming too (compatible with other languages and services).

Few additional flags will also be required, i.e. if push model is used to specify destination, and possibly set up instance and job labels. So one can have multiple lnds and other services monitored by one prometheus system and still be able to filter, select things.

https://godoc.org/github.com/prometheus/client_golang/prometheus for full documentation and a lot of good examples.

Roasbeef commented 6 years ago

As it seems both of y'all already have experience with the tool, how about collaborating to bring the monitoring features to lnd?!

On Wed, May 16, 2018, 11:35 AM Witold Baryluk notifications@github.com wrote:

https://github.com/prometheus/client_golang#instrumenting-applications appears to be official Go library for instrumenting Go code. There are some examples there too, and plenty of tests. It requires HTTP server for serving metrics, but lnd already has one, and I think it should be fine to not require authentication (especially if only listening on ::1 / 127.0.0.1), as long as there are no private data exposed via metric labels and metadata. Or separate HTTP server on separate port can be used (this has advantage of being able to provide some metrics and status even if wallet is still locked).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lightningnetwork/lnd/issues/1243#issuecomment-389621505, or mute the thread https://github.com/notifications/unsubscribe-auth/AA87LsWX2AYJYxALiJqXyIJMZYYt8L-Iks5tzHFSgaJpZM4T8mrB .

baryluk commented 6 years ago

Ok. I can play with it next week, and write initial code and some tutorial how to use it. Not that I have much real world experience with Prometheus, but I do have a bit with Go, and other monitoring systems, so I can do it.

maurycy commented 6 years ago

@baryluk missed this issue. I've ran into similar issues you're trying to solve with Prometheus but picked StatsD because it is faster and more lightweight. Do you want to participate? There's a proof of concept at #1264

baryluk commented 6 years ago

I do have Prometheus integration working. It exports all gRPC server metadata (count of RPCs by type, status ok or error and their latency), Go runtime stats (i.e. garbage collection details and goroutines counts), CPU usages, file descriptor counts and memory usage and few minor lnd specific stats (for a moment just lnd version, blockchain stats, number of channels and peers, size of the graph, and some wire level message counts, wallet and channel balances).

baryluk commented 6 years ago

Example:

$ curl -s "http://[::1]:8081/metrics" | grep lnd
# HELP lnd_active_channels_count Number of active channels across all peers.
# TYPE lnd_active_channels_count gauge
lnd_active_channels_count 4
# HELP lnd_best_header_timestamp What is the header timestamp of the best synced block in a chain.
# TYPE lnd_best_header_timestamp gauge
lnd_best_header_timestamp 1.527089535e+09
# HELP lnd_block_height Height of the best chain.
# TYPE lnd_block_height gauge
lnd_block_height 524015
# HELP lnd_peer_count Number of server peers.
# TYPE lnd_peer_count gauge
lnd_peer_count 5
# HELP lnd_pending_channels_count Number of pending channels.
# TYPE lnd_pending_channels_count gauge
lnd_pending_channels_count 0
# HELP lnd_synced_to_chain Is LND synced?
# TYPE lnd_synced_to_chain gauge
lnd_synced_to_chain 1
# HELP lnd_version Version of LND running.
# TYPE lnd_version gauge
lnd_version{commit="",version="0.4.1"} 1
# HELP lnd_wallet_confirmed_balance Total balance, from txs that have >= 1 confirmations.
# TYPE lnd_wallet_confirmed_balance gauge
lnd_wallet_confirmed_balance 416578
# HELP lnd_wallet_total_balance Total balance, from txs that have >= 0 confirmations.
# TYPE lnd_wallet_total_balance gauge
lnd_wallet_total_balance 416578
# HELP lnd_invoices_count Number of all invoices.
# TYPE lnd_invoices_count gauge
lnd_invoices_count 7
# HELP lnd_chain_transaction_count Number of all transactions on chain.
# TYPE lnd_chain_transaction_count gauge
lnd_chain_transaction_count 6
# HELP lnd_chain_transaction_total_fees Total of all fees across all transactions on chain.
# TYPE lnd_chain_transaction_total_fees gauge
lnd_chain_transaction_total_fees 12196
# HELP lnd_chain_transaction_received_amount_total Total amount of coins in all transactions with positive amounts.
# TYPE lnd_chain_transaction_received_amount_total gauge
lnd_chain_transaction_received_amount_total 4e+06
# HELP lnd_chain_transaction_sent_amount_total Total amount of coins in all transactions with negative amounts.
# TYPE lnd_chain_transaction_sent_amount_total gauge
lnd_chain_transaction_sent_amount_total 3.583422e+06
# HELP lnd_channel_balance_total Total .
# TYPE lnd_channel_balance_total gauge
lnd_channel_balance_total 3.562293e+06
# HELP lnd_channel_pending_open_balance_total Total .
# TYPE lnd_channel_pending_open_balance_total gauge
lnd_channel_pending_open_balance_total 0
# HELP lnd_wire_message_received_by_type_count Total number of wire messages received by type.
# TYPE lnd_wire_message_received_by_type_count counter
lnd_wire_message_received_by_type_count{type="announce_signatures"} 1
lnd_wire_message_received_by_type_count{type="channel_announcement"} 22126
lnd_wire_message_received_by_type_count{type="channel_reestablish"} 6
lnd_wire_message_received_by_type_count{type="channel_update"} 36209
lnd_wire_message_received_by_type_count{type="init"} 7
lnd_wire_message_received_by_type_count{type="node_announcement"} 9120
lnd_wire_message_received_by_type_count{type="ping"} 2
lnd_wire_message_received_by_type_count{type="pong"} 4
# HELP lnd_wire_message_sent_by_type_count Total number of wire messages (to be sent) by type.
# TYPE lnd_wire_message_sent_by_type_count counter
lnd_wire_message_sent_by_type_count{type="channel_announcement"} 26434
lnd_wire_message_sent_by_type_count{type="channel_reestablish"} 4
lnd_wire_message_sent_by_type_count{type="channel_update"} 43045
lnd_wire_message_sent_by_type_count{type="init"} 7
lnd_wire_message_sent_by_type_count{type="node_announcement"} 42372
lnd_wire_message_sent_by_type_count{type="ping"} 4
lnd_wire_message_sent_by_type_count{type="pong"} 2
# HELP lnd_graph_edge_count What is the number of known edges in the network.
# TYPE lnd_graph_edge_count gauge
lnd_graph_edge_count 8822
# HELP lnd_graph_node_count What is the size of know network.
# TYPE lnd_graph_node_count gauge
lnd_graph_node_count 1857

# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
...
grpc_server_handled_total{grpc_code="OK",grpc_method="GetInfo",grpc_service="lnrpc.Lightning",grpc_type="unary"} 2
grpc_server_msg_received_total{grpc_method="GetInfo",grpc_service="lnrpc.Lightning",grpc_type="unary"} 2
grpc_server_msg_sent_total{grpc_method="GetInfo",grpc_service="lnrpc.Lightning",grpc_type="unary"} 2
grpc_server_started_total{grpc_method="GetInfo",grpc_service="lnrpc.Lightning",grpc_type="unary"} 2

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.3772e-05
go_gc_duration_seconds{quantile="0.25"} 5.0189e-05
go_gc_duration_seconds{quantile="0.5"} 5.9294e-05
go_gc_duration_seconds{quantile="0.75"} 0.00011721
go_gc_duration_seconds{quantile="1"} 0.007832316
go_gc_duration_seconds_sum 0.442792173
go_gc_duration_seconds_count 844
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 177
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.3921608e+07
...
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.605632e+06
# HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 2.893806e+08
...
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 61.21
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 65536
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 28
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 2.19295744e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.52708915298e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.187811328e+09

Adding new counters is relatively easy.

baryluk commented 6 years ago

Per channel stats:

# HELP lnd_channel_capacity L...
# TYPE lnd_channel_capacity gauge
lnd_channel_capacity{active="false",channel_id="574380476322676737",public="true"} 214606
lnd_channel_capacity{active="false",channel_id="574560796294316032",public="true"} 100000
lnd_channel_capacity{active="true",channel_id="574244136963276801",public="true"} 1.8e+06
lnd_channel_capacity{active="true",channel_id="574379376815177729",public="true"} 958732
lnd_channel_capacity{active="true",channel_id="574379376815243265",public="true"} 597888
# HELP lnd_channel_commit_weight L...
# TYPE lnd_channel_commit_weight gauge
lnd_channel_commit_weight{active="false",channel_id="574380476322676737",public="true"} 600
lnd_channel_commit_weight{active="false",channel_id="574560796294316032",public="true"} 552
lnd_channel_commit_weight{active="true",channel_id="574244136963276801",public="true"} 600
lnd_channel_commit_weight{active="true",channel_id="574379376815177729",public="true"} 600
lnd_channel_commit_weight{active="true",channel_id="574379376815243265",public="true"} 600
# HELP lnd_channel_csv_delay Local channel config CSV delay
# TYPE lnd_channel_csv_delay gauge
lnd_channel_csv_delay{active="false",channel_id="574380476322676737",public="true"} 144
lnd_channel_csv_delay{active="false",channel_id="574560796294316032",public="true"} 144
lnd_channel_csv_delay{active="true",channel_id="574244136963276801",public="true"} 144
lnd_channel_csv_delay{active="true",channel_id="574379376815177729",public="true"} 144
lnd_channel_csv_delay{active="true",channel_id="574379376815243265",public="true"} 144
# HELP lnd_channel_external_commit_fee L...
# TYPE lnd_channel_external_commit_fee gauge
lnd_channel_external_commit_fee{active="false",channel_id="574380476322676737",public="true"} 4163
lnd_channel_external_commit_fee{active="false",channel_id="574560796294316032",public="true"} 1665
lnd_channel_external_commit_fee{active="true",channel_id="574244136963276801",public="true"} 1448
lnd_channel_external_commit_fee{active="true",channel_id="574379376815177729",public="true"} 1520
lnd_channel_external_commit_fee{active="true",channel_id="574379376815243265",public="true"} 1802
# HELP lnd_channel_fee_per_kw L...
# TYPE lnd_channel_fee_per_kw gauge
lnd_channel_fee_per_kw{active="false",channel_id="574380476322676737",public="true"} 5750
lnd_channel_fee_per_kw{active="false",channel_id="574560796294316032",public="true"} 2300
lnd_channel_fee_per_kw{active="true",channel_id="574244136963276801",public="true"} 2000
lnd_channel_fee_per_kw{active="true",channel_id="574379376815177729",public="true"} 2000
lnd_channel_fee_per_kw{active="true",channel_id="574379376815243265",public="true"} 2000
# HELP lnd_channel_local_balance_by_channel Local balance of a channel by channel
# TYPE lnd_channel_local_balance_by_channel gauge
lnd_channel_local_balance_by_channel{active="false",channel_id="574380476322676737",public="true"} 210443
lnd_channel_local_balance_by_channel{active="false",channel_id="574560796294316032",public="true"} 0
lnd_channel_local_balance_by_channel{active="true",channel_id="574244136963276801",public="true"} 1.798552e+06
lnd_channel_local_balance_by_channel{active="true",channel_id="574379376815177729",public="true"} 957212
lnd_channel_local_balance_by_channel{active="true",channel_id="574379376815243265",public="true"} 596086
# HELP lnd_channel_pending_htlcs_count Local commit HTLCs count
# TYPE lnd_channel_pending_htlcs_count gauge
lnd_channel_pending_htlcs_count{active="false",channel_id="574380476322676737",public="true"} 0
lnd_channel_pending_htlcs_count{active="false",channel_id="574560796294316032",public="true"} 0
lnd_channel_pending_htlcs_count{active="true",channel_id="574244136963276801",public="true"} 0
lnd_channel_pending_htlcs_count{active="true",channel_id="574379376815177729",public="true"} 0
lnd_channel_pending_htlcs_count{active="true",channel_id="574379376815243265",public="true"} 0
# HELP lnd_channel_remote_balance_by_channel Remote balance of a channel by channel
# TYPE lnd_channel_remote_balance_by_channel gauge
lnd_channel_remote_balance_by_channel{active="false",channel_id="574380476322676737",public="true"} 0
lnd_channel_remote_balance_by_channel{active="false",channel_id="574560796294316032",public="true"} 98335
lnd_channel_remote_balance_by_channel{active="true",channel_id="574244136963276801",public="true"} 0
lnd_channel_remote_balance_by_channel{active="true",channel_id="574379376815177729",public="true"} 71
lnd_channel_remote_balance_by_channel{active="true",channel_id="574379376815243265",public="true"} 354
# HELP lnd_channel_satoshis_received L...
# TYPE lnd_channel_satoshis_received gauge
lnd_channel_satoshis_received{active="false",channel_id="574380476322676737",public="true"} 0
lnd_channel_satoshis_received{active="false",channel_id="574560796294316032",public="true"} 0
lnd_channel_satoshis_received{active="true",channel_id="574244136963276801",public="true"} 0
lnd_channel_satoshis_received{active="true",channel_id="574379376815177729",public="true"} 0
lnd_channel_satoshis_received{active="true",channel_id="574379376815243265",public="true"} 0
# HELP lnd_channel_satoshis_sent L...
# TYPE lnd_channel_satoshis_sent gauge
lnd_channel_satoshis_sent{active="false",channel_id="574380476322676737",public="true"} 0
lnd_channel_satoshis_sent{active="false",channel_id="574560796294316032",public="true"} 0
lnd_channel_satoshis_sent{active="true",channel_id="574244136963276801",public="true"} 0
lnd_channel_satoshis_sent{active="true",channel_id="574379376815177729",public="true"} 71
lnd_channel_satoshis_sent{active="true",channel_id="574379376815243265",public="true"} 354
# HELP lnd_channel_updates_count Local commit height
# TYPE lnd_channel_updates_count gauge
lnd_channel_updates_count{active="false",channel_id="574380476322676737",public="true"} 37
lnd_channel_updates_count{active="false",channel_id="574560796294316032",public="true"} 356
lnd_channel_updates_count{active="true",channel_id="574244136963276801",public="true"} 75
lnd_channel_updates_count{active="true",channel_id="574379376815177729",public="true"} 114
lnd_channel_updates_count{active="true",channel_id="574379376815243265",public="true"} 282

My next steps are to add information about, number of peers by inbound/outbound, histogram of ping times to peers, and total bytes sent / received and total sat sent/received , as the number of peers is usually bounded, breaking this down by peer pub_key is an option too.

After that:

number of broadcast batches and their sizes.
peer errors (i.e. EOF or parsing errors)
counts of channel location failures (i.e. received update about unknown channel from another peer)
number of graph vertex / edge updates (possibly separate by addition / deletions / update)
forwarding stat counters

These roughly correspond to what is seen in LND stderr/stdout and the log file during normal operation, just in more organized and machine parsable way.

crt434 commented 6 years ago

@baryluk hey Witold you have any email or are you on the lnd slack? would love to talk about this stuff further

baryluk commented 6 years ago

Small prelude of Grafana pulling data from Prometheus:

cac58d92-5f0c-4676-9a96-c051e80254dd

@crt434 , I should have a pull request reading in a day or two. It is rather simple, but there is some code duplication in few metrics and I am following some hints from Prometheus Go client library authors to make it cleaner (and more efficient).

baryluk commented 6 years ago

@crt434 I am on lnd ( lightningcommunity ) slack as baryluk.

Roasbeef commented 6 years ago

@baryluk that looks dope!!!

I've been chatting w/ @cfromknecht a bit on the the least invasive way to integrate pluggable monitoring. We've concluded that our preferred path would be to: pull in the existing set of streaming RPCs, poll the relevant non-streaming RPC's, and to finally create a new StreamingTelemetry sub-systems and server-side streaming RPC. The role of this new streaming RPC would be to export simple key-values over the RPC system. The upside of this, is that we don't need to fully integrate a specific metrics or time-series DB, as that would increase the size of dependancies by quite a bit and increase the binary size as well. Instead we'd create internal schemas something like: channel.channel_point.settle, then the value should be something like r_hash.value.fee, and so on. This loosely couples the service pulling the metrics, from those that are actually generating them, and we can even utilize build flags such that if this isn't enabled, this is just a no-op.

The next level would be to collaborate on a new external repo that packages up: prometheus (or preferred, but prometheus seems pretty feature complete), the new daemon that exports parses the metric schema, and finally Grafana. Ideally this is able to be easily run as a docker container, such that it's a "one click install"!

So with this, we can easily add new relevant exported metrics (aside from those that are already in the streaming RPC interface) and also the lndmon deamon itself can also plug in diff metric backends.

Roasbeef commented 6 years ago

We can also start a new channel on the slack to discuss this ongoing.

baryluk commented 6 years ago

I find this pointless. Prometheus works very well and can be easily converted to other monitoring system using generic proxy, as it has rich enough metadata in metrics itselfs. Also prometheus Go library do have plugable modules to export data to other systems. Simply need to link in an additional package, or import a package, zero code changes to export stats to InfluxDB, Graphite or StatsD.

The upside of this, is that we don't need to fully integrate a specific metrics or time-series DB, as that would increase the size of dependancies by quite a bit and increase the binary size as well. [...]

That is pure speculation. Set of dependencies is extremely small. There is almost no change in size of binary or memory usage (my guess is that it is less than 50KB for code and 50KB of memory for state keeping), or CPU usage (my guess it is less than 0.1%, and for some metrics actually exactly zero, as they are only computed at scrape time only) in my prototype. It is immeasurable, and compared to StatsD model, does not add significant network and CPU overheads that scale as the amount of processing incressing (which is silly).

Roasbeef commented 6 years ago

You find the external daemon pointless? The goal there is to reduce the surface area of lnd itself, by providing the capabilities to export responding to requests for metrics to a distinct process. If any security vulnerabilities are discovered in the prometheus clients, then lnd itself would be isolated as the lndmon process would be given a read-only macaroon. The RPC route would also allow us to hide special notifications behind a build flag that can be used to further synchronize our integration tests.

Roasbeef commented 6 years ago

I am in no way, arguing against the usage of Prometheus. From what I've learned of it, it seems pretty amazing.

baryluk commented 6 years ago

If you invent a new set of streaming RPCs, and put monitoring outside of the lnd, I think the security will be actually lower, not to mention that the format used to communicate between lnd and external monitoring daemon would most likely follow very closely some existing monitoring system, or be broken due us not having enough experience with all potential future corner cases of monitoring. Why not Prometheues? It is already proven, and well maintained.

[Security vulnerability in Prometheues Go client]

Very unlikely. Almost impossible. The client takes zero input from external users (other than what HTTP server receives via GET request), it never connects actively to any system (trusted or unstrusted), and one is not supposed to leak any dynamic or private data via monitoring (if one does it is a fault of developer/user, not Prometheus Go library). My code opens separate HTTP server (standard Go http server) on localhost, but we can integrate into existing HTTP server, without needing to use separate port (there are benefits to having it on a separate port tho, and with separate request queue / threads for processing metrics). Only issue I could find about using Prometheues Go client is, if one scrapes metrics many times per second (which I actually have solutions for in my prototype - i.e. the size of the Lightning network graph requires O(n) traversal of all vertices, but I do cache results for 60 seconds, so not a problem), or the port is open to public and it is vulnerable to DDoS, but that is true to any monitoring solution, and sensible default and instructions to the user should solve that. Standard interval for monitoring lnd (so called scraping interval) and other systems should be 60 seconds. So not a problem. And even if some data leaks (like channel ids or peer pubkeys, or wallet balances), because somebody added metric with sensitive labels, and made it be open to public, it is not the end of the world, as many of this metadata are already public via network itself.

If any security vulnerabilities are discovered in the prometheus clients, then lnd itself would be isolated as the lndmon process would be given a read-only macaroon.

It is already read only.

The RPC route would also allow us to hide special notifications behind a build flag that can be used to further synchronize our integration tests.

I do not follow. What "special notifications"? Why this feature should be behind build flag? Monitoring is monitoring. It should not be used for too many things. One can use metrics in tests to verify some stuff (like the monitoring counters itself, or some internals of lnd), but that is really it.

You find the external daemon pointless?

More counterproductive, redundant, and reinventing the wheel.

The goal there is to reduce the surface area of lnd itself, ...

And I think using Prometheues go client achieves exactly that (compared to adding lndmon, integration tests, maintaining multiple implementations of lndmon, reinventing communication formats, which has a lot of subtle design issues, like buffering, sync/async, parsing, escaping of labels, no monitoring redundancy, etc). Big plus is that Prometheus go client library is small (if not extremely small), well defined and maintained by experts in monitoring. Only security concern would be the fact that this is external dependency, and compromised code could sneak into the code base. But this is official Prometheus dev code, and we can pin dependency to specific release / commit hash. And no different than dozen other dependencies lnd uses.

I understand that using APIs and external daemon seams like more flexible, and extensible and more secure solution, but I not agree with any of these points. I feel any home brewed solution will perform more poorly, be limitting in some way, be incompatible with all monitoring systems to some degree, and be maintenance problem in the future (especially if there are multiple implementations of lndmon).

Roasbeef commented 6 years ago

or the port is open to public and it is vulnerable to DDoS, but that is true to any monitoring solution, and sensible default and instructions to the user should solve that.

AFAIK, that's always the case currently? As in there's no default authentication or anything like that, so users would have to spin up their own via like a proxy or special headers perhaps?

The RPC route would also allow us to hide special notifications behind a build flag that can be used to further synchronize our integration tests.

Special notifications as in once a particular operation hits a particular milestone. So like if we're forwarding an HTLC, when we actually decode the onion blob or something like that. This would allow us to further synchronize the integration tests when we do things like restarting or persistence tests.

Why this feature should be behind build flag? Monitoring is monitoring. It should not be used for too many things.

It should be behind a build flag simply because most instances will never actually utilize the features (example laptops and mobile clients), so we can make it opt-in essentially. Only if users set the build flag (when compiling) and also provide the relevant args should it become active. I've read the prometheus go client and it's surrounding code, and it is indeed much lighter than I originally suspected.

I've started to move away from the lndmon architecture all together however. Instead, I favor integrating prometheus at a closer level, but still behind a build flag. So in this case, the telemetry package would expose a certain set of interfaces that closes mirrors the histograms, gauges, counters, etc that come along with prometheus. The default implementation (no build flag specified) would simply just map and proxy these metrics over the regular streaming RPC. However, with the build flag set it would parse the relevant configuration, and set up the appropriate Prometheus client. The one challenge with this approach is creating a set of interfaces that's closely maps to the existing Prometheus interface without deviating too much while still providing it with all the extra options and arguments that come with all the gadgets.

alevchuk commented 5 years ago

As the complexity of LND grows, having monitoring will allow quickly detecting performance regressions. Also, by tracking balances, we can detect correctness regressions. Please consider making an early iteration on this.

alevchuk commented 5 years ago

friendly-ping