Feature request: user-facing observability to identify config or performance issues, and indicate what a user could optimize

crProductGuy commented 2 years ago

Description

As a node operator and miner, I would like objective data to know:

If my machines' configurations are not optimal,
or if my network connections are insufficient for reliable and optimal for syncing and submitting candidate mined blocks,

So that:

I can fix what's possible within my control,
I can be aware of things beyond my local control that might have other solutions (e.g. switch hosting location to a better-connected place).

I'd also like the Iron Fish network code to self-optimize peer connections for low and predictable latency across the worldwide network of nodes, so that my candidate mined blocks are efficiently propagated and miners have a fair chance for "their" blocks to be accepted.

The larger focus is, what should the user know, to be able to optimize their system, whether local or cloud-based. This could include last-mile networking connections (e.g. long latency or high packet loss with peers).

Summary: 1) Measure and identify troublesome latency or performance mismatches that are limiting mining success, that wouldn't otherwise be detectable by the user / mining operator.

2) Present simple corrective actions or recommendations if any are discovered. (self-explanatory to the user, to limit the number of new Support questions)

3) Identify other issues that may not be easily solved, that could frustrate users, so they are at least aware.

Common questions this enhancement can help address:

"Is my system working right? How can I know that?" "Are all my nodes and miners healthy and working together properly? "Am I configured for success to maximize the mining power I have invested in?"
"Do I have a fair chance to mine blocks? "Did something go wrong? I haven't mined any blocks in 2 weeks!" "Why am I mining only 1/4 as many blocks as the hashrate calculator predicts?"

Bad outcomes or situations to detect and prevent:

1) User mines a candidate block (wins the first race) but loses the second race (network consensus) due to excessive delays from the miner to the node's peers, where something could be changed to reduce those delays.

2) One of several miners (stand-alone node+miner or separate miner homed to a central node) becomes flakey, unreachable, goes off-line, or stops running.

This is a divide-and-conquer scenario for system-level and distributed network-level problems. There are 3 parts to this:

A) Internal to the machine or between cooperating distributed node & miner machines. Are the machines running well and working together right?

B) External, from the node to its peers (for distributing candidate mined blocks and receiving chain updates).

The external component has both a common part (local ISP connection) and distributed parts (Internet latencies to the current 50 peers, each with its own local ISP connection).

Ideally all paths would be instrumented well enough to identify major sources of delays that the user should know about and might be able to influence.

Example root causes to try to detect and report:

A) CPU is too busy and cannot multitask quickly enough to provide CPU cycles to the node when needed.

(This is a hypothesis. I don't know if it's a real thing. That should be checked first with a quick engineering spike, with different CPU loadings on a test machine with various numbers of mining threads and other CPU and memory bandwidth loadings, to see if native installations and Docker containers on Windows, Linux, and Mac have unresponsive node processes occur badly enough to be worth more work and "real observability").

Measurement ideas:

Establish a regular heartbeat timer between the miner and the node.
Timestamp when a mined block is sent by the miner process to the node, and when the node receives it (with sub-microsecond resolution if possible). Keep statistics and show average, median, and 90th percentile. Record in-machine latency for every block submitted to the network and whether it was accepted on the main chain. Note if there was a difference in internal latency for accepted blocks vs. orphan blocks.

Reporting/alerting: If median is high or rising or bouncy, or if 90th percentile is excessive vs the median, or too many blocks are not accepted due to high in-machine latency, notify the user.

Correction: reduce mining threads or stop competing processes on the machine.

MVP: build a quick and dirty latency measurement and let the data speak for themselves: is internal latency or "absent" CPU under-serving the Node or Miners even a thing? Only build the above more elaborate observability and alerting if it seems to be a real problem that the user can affect.

B) Separated node and miner machines appear to have adequate connectivity, but the performance or consistency is too poor for successful mining and block submission.

This is a system-level latency and maybe packet loss issue which could be detected by enhanced telemetry and reported to the user. Measurements could be similar to the above, but now the target root causes are external to the machines (latency and packet loss within and between data centers). The user would want to know if any miners have bad connections to the respective node(s).

Measurements: Regular round-trip heartbeats (once per block time? Once per Mining Batch time?) between miner(s) and node.

Corrections:

Provision the node and miner(s) on a high-bandwidth LAN within a single rack, data center, or availability zone, etc., to minimize latency and competing traffic.
Use LAN connections at home instead of WiFi. In my testing, this can save several milliseconds of latency on 5GHz AC connections with multiple devices on the WLAN and multiple neighbors' WiFi networks within 50 meters causing RF interference, router channel "hunting", and signal level fluctuations. LAN connections definitely reduce the variability. I run ping -t google.com continously, and typically see 5-7 mS ping times even over a VPN (NordVPN OpenVPN over UDP). I have modern mesh routers with GigEth backhaul, and Ziply gig fiber ISP.

C) A combined machine (single server w- node & miner) has a slow, erratic or lossy, or overloaded ISP connection, which adds excessive latency that interferes with optimum node & miner operation and communication to peers.

Measurements:

Regular pings or DNS pings to Google or other well-known, well-distributed hosts to verify the local ISP connection's speed, latency, and stability.

Corrections: i) upgrade from asymmetric-bandwidth consumer DSL (with low upstream bandwidth) to fiber or business-grade connections with symmetrical speeds and low latency ii) move the machines to another location with better connectivity

D) Outside the user's control, but potentially within algorithmic control of the Node code by self-optimizing on peer choice.

Measurements: regular app-level round-trip times between each node and its peers. Temporarily ban peers to which there is high or variable latency. Ask peers which of their peers are well-connected. Try to build a smart self-adjusting mesh of high-bandwidth reliable peers, with less-well-connected peers as leaf nodes off those. In other words, all nodes should 'demand" and iterate through peers such that that some minimum percentage (30%?) of their peers are well-connected, and each node should tolerate some number (15? 20?) of peers with worse latency or loss rate.

Node-propagation optimization:

There must be some published papers on optimizing a mesh network for rapid bidirectional propagation and monitoring of consistency of connections even as peers come and go or Internet congestion comes and goes.

I hope there is material in the blockchain/crypto space already.

Some of that literature might also be found in network routing algorithms, like OSPF or BGP, though neither of those are optimized for short multi-hop propagation delay.

This must be a solved problem that's "only a small matter of coding" to embed known-good practices (or at least make the next iterative Iron Fish improvement and be able to measure if it's indeed better.

crProductGuy commented 2 years ago

Additional idea: enable users to enter node/miner metadata to allow subsequent large-scale data analysis to discover any significant patterns in "losing mined blocks" after submitting them to the network, etc. Analysis could help inform the answers to the "what can I optimize" questions above. Of course this would be opt-in; users could enter as much or as little data as they like.

Example metadata that could be queried by telemetry: geographical location (nearest city), ISP upstream and downstream speeds and latency, use of a VPN to another target geography, VPN protocol, typical pingtime to a well-homed service (including what service was used, like Google or Yandex, etc.), model and clockrate of CPU, threads/cores in use, etc.

Analysis might reveal patterns of poor results, which could then be analyzed to try to find the "why's".

crProductGuy commented 2 years ago

Network Ping Jitter to Google before-after stopping Mining with 100pct CPU - 2022-02-17

The screenshot shows evidence that high CPU loads can affect other processes performing network operations. I always have a terminal running ping -t google.com to monitor my last-mile, ISP, and VPN latency.

The upper 2/3 of the screenshot is while mining with full 16 threads on 8-core Ryzen 5700G with lots of browser tabs open, too. Windows 11 Task Manager shows 100% CPU usage. Note the jitter in the ping times.

Bottom 1/3 of screenshot below the red line is after I stopped the miner process. Latency and jitter both dropped.

We conclude that high level of CPU utilization can cause 3-10 mS of excess and variable latency between a CMD window process and the network interface. We cannot tell where in the data chain that the extra latency is occurring: inbound, outbound, or both. Task Manager Processes tab with no mining shows that Brave Browser is taking 1.3-2.0 % of CPU. Iron Fish Node process is taking another 1.7%, and Ryzen Master another 1-1.3% CPU, while the rest of the system is largely idle.

lwisne commented 2 years ago

We have all kinds of needs around metrics and observability - will be opening more requests like this one soon.

iron-fish / ironfish

Feature request: user-facing observability to identify config or performance issues, and indicate what a user could optimize #988

Description