lrettig commented 4 years ago

Telemetry (and Logging)

Overview

This SMIP proposes adding telemetry to go-spacemesh. This SMIP answers the questions: Why, When, and What, but not How.

What is Telemetry

The telemetry is a collection (a process of collecting) of some useful data about the system. So to decide what telemetry must collect and when we must know why.

Why

The telemetry can be helpful in the following topics:

Marketing research To know who is typical smesher
Development Research To know how spacemesh network really works/to test protocol viability
Monitoring To respond quickly to issues and accidents
Developing To test and debug with a test-network efficiently (CI integration?)
Products and Services e.g. opted-in full nodes public ip addresses for geo-location.

When and What

Regarding the telemetry topics, there is a set of events where telemetry is collected and possible data to collect.

When the spacemesh client is starting (or is installing?) the telemetry collects

Hardware information:
- Memory:
  - Total size
- CPU:
  - Name in Operating System (like 'Intel Core i7 9750H')
  - Vendor (like Intel/AMD),
  - Total count of cores/threads for all CPUs (like 6/12)
  - Rated frequency (like 2.6 Ghz)
  - Extensions (like AVX, AVX2, AVX512)
- For every GPU:
  - Name in Operating System (like 'NVIDIA GeForce GTX 1660 Ti')
  - Vendor (like Nvidia/Intel/AMD)
  - Architecture (like TU116)
  - Driver Version (like 451.67)
  - Memory Size
  - CUDA/OpenCL/DirectML support
- For every disk in system:
  - Total free size and percentage of free space
  - Device Kind (HDD/SSD/RAIDx/...)
  - Name in Operating System (like 'Samsung SSD 970 EVO')
Operation system information:
- OS Name
- System/Kernel version
- Uptime
Network information:
- IP/Netmask/Gateway. (if full telemetry is enabled)
- Routing path to telemetry server. (if full telemetry is enabled)
- Ping to telemetry server min/max

When mining space generation is done:

Mining space generation statistics:
- Total generation time
- Method
  - Device kind (like CPU/GPU)
  - Name in Operating System (like 'NVIDIA GeForce GTX 1660 Ti')
  - Used extensions (like SSE, AVX, AVX2, OpenCL, CUDA, DirectML ...)
- Storage
  - Device kind (like HDD/SSD/RAIDx/...)
  - Name in Operating System (like 'Samsung SSD 970 EVO')
  - Total device size
  - Used partintion size
  - Mining space size

When Initial sync is done:

Sync statistics:
- Count of synchronized layers
- Synchronization speed (layers per minute)
- Total synchronization time
- Total databases size (?)

When the client is running the telemetry collects for every timeframe (one layer interval?):

System load:
- CPU user load min/max
- CPU system load min/max
- GPU load min/max
- Disk I/O load min/max
- Network I/O load min/max
- System memory allocated/free min/max
- Go memory allocation information
Synchronization:
- Current layer
- Last synchronized layer
- Total size of databases (?)
P2P statistics and current topology:
- Active peers per minute
- Packets count recv/sent (for every package type?)
- Packets count recv/sent per minute ( min/max/avg for every package type?)
- For every peer connected during the timeframe:
  - Target nodeID
  - First time of connect
  - Last time of disconnect and reason
  - Count of (re)connections with this peer during the timeframe
  - All reasons for disconnects during timeframe (bitmask)
  - Target IP address (if full telemetry is enabled)
  - Routing path (up to 8 hops) (if full telemetry is enabled)
  - Ping min/max,
  - Notwork bytes transfered during the timeframe recv/sent
Consensus (if finished during timeframe):
- Hare statistics:
  - count of rounds
  - is final round succided to achieve consensus
  - count of droped/invalid messages
  - eligibility/participance in every step for the last (every?) round
  - common weighted parametes for consensus
- Tortoise statistics:
  - TBD
- PoET statistics:
  - TBD

When a fatal error or panic/crash occurs:

current timeframe telemetry
error information (call stack, error message, error type, error source protocol)
memory contents (?)
binary logs (?)

When a non-fatal error occurs:

just the error information

When debugging/testing assertion event occurs:

just the assertion information

Logs pulling

Regarding development and issues/accidents respond can be very useful to have conditional log pulling to collect logs on-demand from chosen nodes.

lrettig commented 4 years ago

Some open questions:

is IP/node location information explicit (i.e., included in logs) or implicit (collected by the data collection server based on inbound connection data)?
where is network ID included?

y0sher commented 4 years ago

Q: what made you decide to use influxDB instead of Elasticsearch which we're already using and Prometheus which we're already using ? Are there advantages switching completely to influx over tuning up the current infra ?

in other words, why not to use our existing telemetry infra?

YaronWittenstein commented 4 years ago

For logs - please also consider Loki. It's from the Grafana team.

avive commented 4 years ago

System metrics (WIP) - will be updated in-place

Here's an initial spec for system metrics.

CPU Maker, model and number of cores / native threads.
System total RAM.
Enum system GPUs (modern systems have at least 2 GPUs - motherboard + 1 card) - list full model name for each GPU. e.g GeForce RTX 2080. NTH: For each gpu, GPU RAM. e.g. 4GB.
Enum system volumes. For each: Capacity, free space, NTH: type. e.g. HDD / SSD
OS full name and version. e.g. Windows 10 Home build u.x.y.z.
Computer Internet connection public IP address (so we can geo-decode to lat/long for the dashboard geo heatmap).
NTH: some estimate of the system's Internet connection capacity. e.g. upstream and downstream bandwidth. Might not be trivial to obtain this from OSes.

y0sher commented 4 years ago

For logs - please also consider Loki. It's from the Grafana team.

also it should work good and in the same platform with Prometheus..

lrettig commented 4 years ago

@antonlerner asked on this morning's call, do we want to handle telemetry in-process in go-spacemesh, or do we want instead to continue to collect logs/events/metrics out of process (as we're doing now with logbeat/elastic search/kibana)? I had been operating under the assumption that we want to do it in-process, but it's a valid question and I can see arguments in both directions.

Please let us know if you have thoughts or preferences on this point.

avive commented 4 years ago

I don’t understand the separate process option. From a product perspective running a Spacemesh full node should always be one process and the user should never be asked to run or manage another system-level process. When running the app, the app should only manage 1 go-spacemesh process and never more - we are having enough issues just with managing one such as not being able to terminate it in some cases. It is quite hard to manage multiple processes on end-user machines - this is also the reason why desktop apps are always encapsulated in one. This is also one of the main reasons @moshababo and @YaronWittenstein decided to have the post commitment setup process inside go-sm and not a separate system process. So, from a product perspective having to run another external telemetry process to send go-sm telemetry is not really a viable option.

YaronWittenstein commented 4 years ago

AFAIK the tendency today is to separate the applicative logging code from the physical aspect. So usually, applications will write to STDOUT and then another process (a.k.a the forwarder) will be in charge of taking that data and forward it to outside the machine/container boundaries.

The forwarder can, of course, do some basic filtering and normalization to the raw logs but overall, it will be detached from the application logic.

The forwarder can send different logs to other sinks by some basic filtering. For example: type=A goes to ElasticSearch and type=B goes to InfluxDB.

Another option is to forward all logs to one centralized place and there do the matching of what goes where.

If we want the end-users to log data too it seems that packaging everything into go-spacemesh is the way to go - otherwise the logging to STDOUT should be preferred way I believe.

antonlerner commented 4 years ago

Separating the logging collector service from our service seems very right to me, since incorporating non essential code in our executable can cause unrelated issues to our run. also, since eventually collecting and sending metrics will be optional I see no reason why we should fiddle with our running process to enable or disable this feature... I also support @YaronWittenstein suggestion, which is running a forwarding agent alongside our code. Moreover, same as @y0sher , I'd like to understand what is missing in our current logging solution that requires re design of the logging services.

avive commented 4 years ago

@antonlerner, yaron actually said above If we want the end-users to log data too it seems that packaging everything into go-spacemesh is the way to go.

We don't have the luxury to run multiple processes (and another process to manage them - oy vei) on end user machines. There is not even a single desktop app that works this way that I'm aware of, and most desktop apps have data reporting / feedback features. The sm full node must be 1 system executable with sub processes. Your arguments are very correct for server-side apps where you have full dev-ops control over them but we can't do this on the client-side. Every feature we decide to put in the full node is a full node feature, including telemetry and data reporting. What you say may have some pros but the cons to manage multiple system processes are more serious cons which are mentioned above in this thread.

YaronWittenstein commented 4 years ago

@antonlerner, yaron actually said above If we want the end-users to log data too it seems that packaging everything into go-spacemesh is the way to go.

We don't have the luxury to run multiple processes (and another process to manage them - oy vei) on end user machines. There is not even a single desktop app that works this way that I'm aware of, and most desktop apps have data reporting / feedback features. The sm full node must be 1 system executable with sub processes. Your arguments are very correct for server-side apps where you have full dev-ops control over them but we can't do this on the client-side. Every feature we decide to put in the full node is a full node feature, including telemetry and data reporting. What you say may have some pros but the cons to manage multiple system processes are more serious cons which are mentioned above in this thread.

I agree - we need to strive not to have a distributed system...

lrettig commented 4 years ago

@ilans suggested this morning that, rather than shipping log data and metrics directly to a specific log engine (like influxdb, logbeat/kibana/elasticsearch/whatever), we instead ship raw logs to a backend that we write and control, and that can subsequently route those logs wherever we want.

The upside here is more flexibility and control on our part. The frontend code would be simpler.

The downside is that it's another piece of infrastructure to spec, design, build, and maintain - and it could also be a burden on third parties that want run go-spacemesh and collect log data on their own (i.e., another piece of software they need to run).

avive commented 4 years ago

Whatever you guys decide on this feature, please no external processes to go-spacemesh on clients - it is going to be unnamable. This is a full node feature that is going to be turned on by default for all testnet users and should be able to be turned off via node CLI flag or the upcoming new api.

lrettig commented 4 years ago

@avive I understand your concerns from a product perspective, but I think it's a bit orthogonal to the main question here - because we can always ship the node software as a wrapper around, e.g., go-spacemesh plus another (optional) logging process. The user would still see a single executable, or, more likely, a single start/stop script. Does that make sense?

avive commented 4 years ago

I don't think that that we need any log polling features, at least not in the first version of telemetry and that just sending telemetry data for several metrics will be sufficient. We have no issue of getting logs from testnet users. The focus of telemetry in the short term is to get operational data that is not in the logs and to be able to add just collected data easily for measurements. We also need to be careful about performance here. We need to make sure that having testnet users opt-in to telemetry will not lead to a significant use of cpu and networking resources on an on-going basis (another reason why to exclude log shipping for now). Also, we don't want to add another process that home smeshers will need to run which is not go-spacemesh. Telemetry should be a go-spacemesh feature. I talked about these requirements with @noamnelke recently - I hope he can chime in as soon as he's available.

avive commented 4 years ago

To clarify a bit, I think that this task should focus on specific data collection flow from inside a full node, which is not dependent on logs or logs shipping that should be used to establish metrics. Once we have infrastructure to collect arbitrary data, we can go ahead and specify exactly what data we want from non managed nodes and when we want it - I imagine this will change over time as we'll want to test different things on different testnets. e.g. self-healing performance on non-managed nodes. So I would split this into 2 milestones. Milestone 1 - infrastructure for collecting metrics from non-managed nodes and be able to query it and display it in a managed way. e.g. Kibana interface and test it with some node data. Milestone 2 - define exactly what data we want to collect and send to the metrics service for open testnet and in what frequency.

sudachen commented 4 years ago

@aviv As I understand it, telemetry is exactly about collection metrics and logs from the network, not about internal flow. Also, I'm sure that logs and network metrics are not the same, and logs are optional and conditional against cases. So the current way, when metrics (and derived logic like assertions) follow from logs, is really wrong. But it's just my view and I'd like and will discuss it with others.

By the way, I'm not sure it requires splitting to milestones in this way. I'd like to implement fully working telemetry this month. Of course, it can and will have extended/modified metrics, what I specified here is undoubtedly required, as I see. However, it should go from the cases we have, not from imagination.

Related to implementation. In my view, the telemetry protocol must not depend on the backend storage and user interface of the dashboard/console. I really think that storage and interface have the second priority after the telemetry implementation in nodes. It can be changed/improved at any moment. At the start, it can be just a lambda/google-function + event-queue gateway with any hosted database with CLI console utility.

I think we need logs pulling because it allows us to resolve strange cases quickly and at lease resolve it -) Also why it's just pulling - yes because we do not need logs from every node, there is no needs to send it. But we can auto-conditionally choose nodes from where we need a log by the current network/nodes metrics.

avive commented 4 years ago

@sudachen All your comments on mine make sense.

avive commented 4 years ago

Feedback on public ip address reporting data field in the spec. There should be a separate flag for opting-in to reporting personal identifiable information separate than node flag to report operational data. If node is opted-in than set boolean flag in node's report indicating user agreed to ip address collection and obtain the public ip address from the telemetry report network connection remote endpoint as the node may not know its public ip address. We can't store this ip address unless the user has opted-in to this. On testnet it will be opt-in by default and on main-net opt-out by default to preserve and respect users privacy.

noamnelke commented 4 years ago

@avive I think you flipped opt-in and opt-out.

opt-in == you don't report by default, but you can opt-in to reporting. opt-out == you do report by default, but you can opt-out of reporting.

We want opt-out in the testnet (everyone reports by default unless they specifically ask not to) and opt-in for mainnet (nobody reports anything unless they proactively asked to send reports).

lrettig commented 4 years ago

the telemetry protocol must not depend on the backend storage and user interface of the dashboard/console. I really think that storage and interface have the second priority after the telemetry implementation in nodes. It can be changed/improved at any moment

This sort of loose coupling makes a lot of sense to me

we can auto-conditionally choose nodes from where we need a log by the current network/nodes metrics.

@sudachen could you expand on this a little? It sounds interesting but I don't fully understand.

spacemeshos / SMIPS

SMIP: Telemetry and logging (WIP) #10

Telemetry (and Logging)

Overview

What is Telemetry

Why

When and What

When the spacemesh client is starting (or is installing?) the telemetry collects

When mining space generation is done:

When Initial sync is done:

When the client is running the telemetry collects for every timeframe (one layer interval?):

When a fatal error or panic/crash occurs:

When a non-fatal error occurs:

When debugging/testing assertion event occurs:

Logs pulling

System metrics (WIP) - will be updated in-place