netobserv / network-observability-operator

An OpenShift / Kubernetes operator for network observability
Apache License 2.0
133 stars 24 forks source link
kubernetes observability operator

NetObserv Operator

GitHub release (latest by date) Go Report Card

NetObserv Operator is a Kubernetes / OpenShift operator for network observability. It deploys a monitoring pipeline that consists in:

Flow data is then available in multiple ways, each optional:

Getting Started

You can install NetObserv Operator using OLM if it is available in your cluster, or directly from its repository.

Install with OLM

NetObserv Operator is available in OperatorHub with guided steps on how to install this. It is also available in the OperatorHub catalog directly in the OpenShift Console.

OpenShift OperatorHub search

Please read the operator description in OLM.

After the operator is installed, create a FlowCollector resource:

OpenShift OperatorHub FlowCollector

Refer to the Configuration section of this document.

Install from repository

A couple of make targets are provided in this repository to allow installing without OLM:

git clone https://github.com/netobserv/network-observability-operator.git && cd network-observability-operator
USER=netobserv make deploy deploy-loki deploy-grafana

It will deploy the operator in its latest version, with port-forwarded Loki and Grafana (they are optional).

Note: the loki-deploy script is provided as a quick install path and is not suitable for production. It deploys a single pod, configures a 10GB storage PVC, with 24 hours of retention. For a scalable deployment, please refer to our distributed Loki guide or Grafana's official documentation.

To deploy the monitoring pipeline, this make target installs a FlowCollector with default values:

make deploy-sample-cr

Alternatively, you can grab and edit this config before installing it.

You can still edit the FlowCollector after it's installed: the operator will take care about reconciling everything with the updated configuration:

kubectl edit flowcollector cluster

Refer to the Configuration section of this document.

With or without Loki?

Historically, Grafana Loki was a strict dependency but it isn't anymore. If you don't want to install it, you can still get the Prometheus metrics, and/or export raw flows to a custom collector. But be aware that some of the Console plugin features will be disabled. For instance, you will not be able to view raw flows there, and the metrics / topology will have a more limited level of details, missing information such as pods or IPs.

OpenShift Console

Pre-requisite: OpenShift 4.10 or above

If the OpenShift Console is detected in the cluster, a console plugin is deployed when a FlowCollector is installed. It adds new pages and tabs to the console:

Overview metrics

Charts on this page show overall, aggregated metrics on the cluster network. The stats can be refined with comprehensive filtering and display options. Different levels of aggregations are available: per zone, per node, per namespace, per owner or per pod/service. For instance, it allows to identify biggest talkers in different contexts: top X inter-namespace flows, or top X pod-to-pod flows within a namespace, etc.

The watched time interval can be adjusted, as well as the refresh frequency, hence you can get an almost live view on the cluster traffic. This also applies to the other pages described below.

Overview

Topology

The topology view represents traffic between elements as a graph. The same filtering and aggregation options as described above are available, plus extra display options e.g. to group element by node, namespaces, etc. A side panel provides contextual information and metrics related to the selected element.

Topology This screenshot shows the NetObserv architecture itself: Nodes (via eBPF agents) sending traffic (flows) to the collector flowlogs-pipeline, which in turn sends data to Loki. The NetObserv console plugin fetches these flows from Loki.

Flow table

The table view shows raw flows, ie. non aggregated, still with the same filtering options, and configurable columns.

Flow table

Integration with existing console views

These views are accessible directly from the main menu, and also as contextual tabs for any Pod, Deployment, Service (etc.) in their details page, with filters set to focus on that particular resource.

Contextual topology

Configuration

The FlowCollector resource is used to configure the operator and its managed components. A comprehensive documentation is available here, and a full sample file there.

To edit configuration in cluster, run:

kubectl edit flowcollector cluster

As it operates cluster-wide on every node, only a single FlowCollector is allowed, and it has to be named cluster.

A couple of settings deserve special attention:

Metrics

More information on Prometheus metrics is available in a dedicated page: Metrics.md.

Performance fine-tuning

In addition to sampling and using Kafka or not, other settings can help you get an optimal setup without compromising on the observability.

Here is what you should pay attention to:

Loki

The FlowCollector resource includes configuration of the Loki client, which is used by the processor (flowlogs-pipeline) to connect and send data to Loki for storage. They impact two things: batches and retries.

On the Loki server side, configuration differs depending on how Loki was installed, e.g. via Helm chart, Loki Operator, etc. Nevertheless, here are a couple of settings that may impact the flow processing pipeline:

With Kafka

More performance fine-tuning is possible when using Kafka, ie. with spec.deploymentModel set to Kafka:

Securing data and communications

Authorizations

NetObserv is meant to be used by cluster admins, or, when using the Loki Operator (v5.7 or above), project admins (ie. users having admin permissions on some namespaces only). Multi-tenancy is based on namespaces permissions, with allowed users able to get flows limited to their namespaces. Flows across two namespaces will be visible to them as long as they have access to at least one of these namespaces.

Since FlowCollector v1beta2, NetObserv is automatically configured with multi-tenancy enabled when spec.loki.mode is LokiStack.

To give flow logs access to a test user, run:

oc adm policy add-cluster-role-to-user netobserv-reader test

More information about multi-tenancy can be found on this page.

Note that multi-tenancy is not possible without using the Loki Operator.

Network Policy

For a production deployment, it is also highly recommended to lock down the netobserv namespace (or wherever NetObserv is installed) using network policies. An example of network policy is provided here.

Communications

By default, communications between internal components are not secured. Note that, when using the Loki Operator, securing communication with TLS is necessary. There are several places where TLS can be set up:

Development & building from sources

Please refer to this documentation for everything related to building, deploying or bundling from sources.

F.A.Q / Troubleshooting

Please refer to F.A.Q / Troubleshooting main document.

Contributions

This project is licensed under Apache 2.0 and accepts contributions via GitHub pull requests. Other related netobserv projects follow the same rules:

External contributions are welcome and can take various forms: