The0mikkel commented 10 months ago

Introduction to the problem

As HTX-LAN has been using this awesome piece of software for some years now, we wanted some more insights into how the cache was utilized live under the LAN parties.

The provided logs did show some HIT and MISS but the actual throughput and performance of the tools were invisible.

Idea

We wanted to export metrics from Bind (DNS) and NGINX in a form that where easy to use in Grafana through Prometheus

Solution

To implement this, we have updated the infrastructure in three services (repositories).

lancache-dns

The first repository, that we have updated, is the lancache-dns.
Here we have added statistics logging to the bind configuration file, which is exposed on port 8053 (TCP).

The core Bind functionality is not changed, but due to statistics now being logged, it does create some extra load. This load is very limited in scope.

This endpoint is not intended to be exposed publicly, but to be used by an exporter, which we have included in the docker-compose repository.

The second repository, that we have updated, is the monolithic.
Here we have added a NGINX status endpoint, that returns the current status for NGINX used in the monolithic setup.
This status endpoint is set up as a standalone site configuration (30_metrics.conf).

Through this site configuration, we expose the status endpoint on port 8080 (TCP), with the help of the stub_status functionality in NGINX. By doing it on a standalone site, the endpoint can easily be disabled by not opening the port in the container.

This endpoint is not intended to be exposed publicly, but to be used by an exporter, which we have included in the docker-compose repository.

docker-compose

The third repository, that we have updated, is the docker-compose, which is the repository in which this issue is created.

The update contains multiple parts:

Exporters

The first part is that we have added two new services, that are used to export the metrics from the lancache-dns and monolithic services.
The two exporters used for this, are the Bind prometheuscommunity/bind-exporter, which converts the bind statistics to a format that can be used by Prometheus, and nginx/nginx-prometheus-exporter, which is used to export the NGINX statistics.

These two exporters are, respectively, community build and official NGINX software. Therefore, we expect them to be maintained and updated in the future, and it is, therefore, adequate to use them here for this use case, instead of building custom exporters and maintaining them.

Network segregation

To build an adequate docker stack for this purpose, and ensure security in the network, we have segregated the network into three parts.

The first part is the main default network. It binds the main services together (lancache-dns and monolithic).

The second network is the dns-metrics network. This handle network connection between the lancache-dns (more specifically bind, and the exposed service on port 8053) and the prometheuscommunity/bind-exporter service.

The third network is the nginx-metrics network. This handles network connection between the monolithic service, and its exposed status service on port 8080, and the nginx/nginx-prometheus-exporter.

By doing it with this segregated network, exporters only have the access that they absolutely need, and no more. Therefore following the principle of least privilege.

This may be simplified, by just using one network, and not following the principle of least privilege.

Healthcheck

To ensure the exporters start correctly, we have implemented health checks on 3 out of 4 of the services in the stack.
The health checks are implemented as a simple curl (for the two Lancache services) and wget (for the prometheuscommunity/bind-exporter) command, that checks that the service is available and running correctly.

The nginx/nginx-prometheus-exporter does not include a health check, due to the container used for this, does not support it.

Healthcecks in general allows for better knowledge of the system status, and in this case, allows us to configure the exporters to only start when the services they are exporting from, are running correctly. This limits the need for restarts.

`METRIC_BIND_IP`

To help segregate the metrics, and in some form limit the availability of the metrics, we have added a new environment variable to the stack. METRIC_BIND_IP can be set to a specific IP address, that the exporters will bind to. Therefore segregating the metrics to a specific IP address, and not the DNS IP, which is the default IP of the metrics endpoints.

This is documented in both the README.md and the .env file.
In the example file, it is left empty, such that it falls back to DNS_BIND_IP.

Prometheus and grafana

To then use the metrics, Prometheus and Grafana are recommended. This is not included in any of the updates, but is a recommendation for the use of the metrics.

This has shortly been documented in the README.md of the docker-compose repository.

Through the exportes, two new endpoints are available, that can be scraped by Prometheus. One for each Lancache service.

The endpoints are:

Service	Port
Bind (DNS)	9119
Monolithic	9113

As an example, of how this data can be used, we have created a simple Grafana dashboard, that shows the data from the two endpoints.

grafana game screenshot

We are still working on a more complete dashboard, that shows more data, and is more complete. If it is wanted, we can include it in a later update.

Note: The dashboard shown as an example, also uses Cadvisor to show the resource usage of the containers. This is not included in the update and is only shown as an example of how the data can be used.

Pull requests

We have created the following pull requests, to the repositories, that we have mentioned above, from our fork at HTX-LAN

Pull requests:

Further improvement

The system may not be perfect, and there may be some improvements that can be made. We have already thought of a few elements, and would bring them up, as a part of this issue, as they may depend on the requests of the maintainers.

List of further improvements:

Patch Docker compose version to support docker compose profiles. This would allow for profile launching of the stack, and therefore allow for a more simple setup, and a more complex setup, depending on the needs of the user, through a simple command.
Update exporters to a newer version, but the current versions are tested to work.
Documentation on lancache.net website. This is not a part of the update, but would be a good addition to the website, to show how the metrics can be used, and document this new feature of the stack.

We are open to any changes that may be needed, in order to conform with the standards of the project and the needs of the maintainers. But we do note, that the current setup, is going to be used in some form, at the next HTX-LAN, and therefore we would like to at least keep the core functionality this adds.

Special thanks go to William Børresen, for being one of the main contributors to this update.

stale[bot] commented 5 months ago

This issue has been automatically marked as inactive because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

The0mikkel commented 5 months ago

This is still awating review.
Since original creation, it has been tested and used at multiple LAN parties.

There seems to be an interest in getting these features. If not all, then some.

xodaaaa commented 3 months ago

I have been following the project is expected new updates for exporters and addition of prometheus ? Will a Dashboard be released at some point? Sorry for my bad English Thanks for the great work

lancachenet / docker-compose

Expose metrics from NGINX and Bind to Prometheus #38