netdata / netdata-cloud

The public repository of Netdata Cloud. Contribute with bug reports and feature requests.
GNU General Public License v3.0
41 stars 16 forks source link

[Feat]: Provide add node instructions for Docker Swarm #567

Closed cakrit closed 2 years ago

cakrit commented 2 years ago

Problem

Our instructions to install a containerized Netdata and connect it to the cloud are not quite there for Docker Swarm. Specifically:

I'm asking the user who encountered the issues to assist us with the exact text and template we should add to the existing instructions.

As a reminder, we're talking about the section here and the relevant installation documentation in learn.

image

Description

See above

Importance

really want

Value proposition

Easier installation for Docker Swarm use cases

Proposed implementation

Add a paragraph below the existing paragraphs. Will wait for specific user input.

cakrit commented 2 years ago

cc @Ferroin

bluepuma77 commented 2 years ago

I had the issue that I only saw cryptic container IDs and no names. Forum helped to add docker.sock. I would just add the line to the CLI template pictured above (and to compose) so others don't need to start the same troubleshooting:

  -v /var/run/docker.sock:/var/run/docker.sock:ro

Some people consider exposing docker.sock as security risk and rather have a proxy in between. You could just add a sentence below, like "If you consider binding docker.sock to a container a security risk, you can remove binding docker.sock. Instead of container names you will see container IDs."

Would be nice to mention Docker Swarm, not sure if it needs an extra section. For Docker Swarm the deploy section is missing in the compose file. While at it, you could add resource limits to the file in case netdata goes crazy and doesn't overwhelm the node (not sure what reasonable values are). Docker Swarm should be able to deal with the volumes that are included in the file.

version: '3.8'
services:
  netdata:
    ...
    hostname: '{{.Node.Hostname}}'
    ...
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    ...
    deploy:
      mode: global # single instance on every node
      resources:
        limits:
          cpus: '0.5'
          memory: 512M
        reservations:
          cpus: '0.2'
          memory: 256M 

PS: Little trick to set the container hostname automatically to the host's hostname. We use bare metal servers in a countable number, so our servers have usable hostnames (proxy-1, app-2, db-3), so this makes it easier for us.

    hostname: '{{.Node.Hostname}}'

Prometheus node-exporter does it differently, it simply binds the hostname file to the container

    volumes:
      - /etc/hostname:/etc/nodename 
cakrit commented 2 years ago

@Ferroin can you check if you agree with all of these suggestions so we can move ahead to make the changes?

Ferroin commented 2 years ago

For the Docker socket, I would argue it’s better to clearly mention the Docker socket requirement on the page itself, instead of explicitly including a line for it in the sample command. Anything that has full access to the socket can pretty easily get access to the host system as whatever user the Docker daemon is running as unless steps are explicitly taken to limit access. We can’t take those steps for the user, and we can’t know what steps they may have taken, so it’s risky to just blindly expose users like that, even if we lose some functionality by not doing so.

For the Docker Swarm thing, I generally agree, though it’s not really possible for us to provide sane defaults for resource limits as our resource usage is too dependent on the exact configuration of the system being monitored.

Entirely agreed on the hostname thing, though I would list it as a suggestion and not just put it directly in the example, as it kind of breaks if a user runs more than one container on the same host. The Prometheus Node Exporter approach is a bit more resilient, though it should be a read-only mount.

cakrit commented 2 years ago

@hugovalente-pm , @Ferroin will describe exactly what will need to change. Once he does, plz plan the change to the text as a low hanging fruit, so we can close it.

Ferroin commented 2 years ago

@hugovalente-pm

Taken together, my suggestion would be to:

We might also consider adding a link on that particular panel to our top-level documentation about running Netdata in Docker (https://learn.netdata.cloud/docs/agent/packaging/docker) directing users there if they run into any issues.

hugovalente-pm commented 2 years ago

thanks @Ferroin , to check I'll ask the help from someone on the FE team to tackle 1st and 3rd bullet and you will take on this one?

Add a sentence on the Docker claiming page in the Cloud along the lines of: ‘Note that Netdata will use the hostname from the container in which it is run instead of that of the host system.’, and include a link directly to https://learn.netdata.cloud/docs/agent/packaging/docker#change-the-default-hostname. I’ll also add some more info in that particular documentation about how to pull the hostname directly from the node the container is running on based on the suggestions above by @bluepuma77.

hugovalente-pm commented 2 years ago

@AlexNti as we discussed let's do:

Add a sentence on the Docker claiming page in the Cloud along the lines of: ‘Note that to correctly display container names when running Netdata as a Docker container, you must expose the Docker socket from the host inside the Netdata container.’, and include a link directly to https://learn.netdata.cloud/docs/agent/packaging/docker#docker-container-names-resolution.

add a note after the "Install..." image

Add a section after the Docker Compose section explaining that the Docker Compose example can be used with Docker Swarm by adding an appropriate deploy section, and that we recommend using the global deployment mode to have an instance running on each Docker Swarm node.

the difference is on services: netdata: area where we add at the bottom of it

...
    deploy:
      mode: global
...

let's put this when user clicks on a checkbox (or something) mentioning he is using Docker Swarm

version: '3'
services:
  netdata:
    image: netdata/netdata
    container_name: netdata
    hostname: example.com # set to fqdn of host
    ports:
      - 19999:19999
    restart: unless-stopped
    cap_add:
      - SYS_PTRACE
    security_opt:
      - apparmor:unconfined
    volumes:
      - netdataconfig:/etc/netdata
      - netdatalib:/var/lib/netdata
      - netdatacache:/var/cache/netdata
      - /etc/passwd:/host/etc/passwd:ro
      - /etc/group:/host/etc/group:ro
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /etc/os-release:/host/etc/os-release:ro
    environment:
      - NETDATA_CLAIM_TOKEN=wzxd7LeMHHiAhjipg3f5ZAT8fbOwN3mfCa4IKhqofehHK0zFKSJrwQjYbM_ZQekWhlarXpQSlPo9pDfbsY9sJka5ColpVVW-UJ6f8Br7m4zeSzLm65-0T7T2lOU3sVFrt4wHJ4M
      - NETDATA_CLAIM_URL=https://testing.netdata.cloud
      - NETDATA_CLAIM_ROOMS=
    deploy:
      mode: global

volumes:
  netdataconfig:
  netdatalib:
  netdatacache:

Also as we spoke let's:

Ferroin commented 2 years ago

@hugovalente-pm The second bullet point needs changes both on the FE side and in the agent docs. I can cover the agent documentation, but someone on the FE team will need to add the info for the Cloud.

ilyam8 commented 2 years ago

I would argue it’s better to clearly mention the Docker socket requirement on the page itself, instead of explicitly including a line for it in the sample command.

IMO should be completely opposite. It just doesn't work, otherwise (prooved by time). The default should include everything we need (e.g. docker container name resolution (mounting socket), docker container network iface monitoring (host pid mode, SYS_ADMIN)) and add a note (make it warning if you want) - X needed for Y, remove if you have security concerns.

Providing default with broken docker container monitoring (and who knows what else, I bet some external collector don't work because we don't add required capabilities) is super strange if you ask me.

I think it has been discussed, but not sure we agreed on that.

Ferroin commented 2 years ago

@ilyam8 I contend that most of the reason users keep running into this problem is mostly that we do not make it clear that there is an issue at all.

I generally agree that it’s a bit strange that we don’t provide full functionality with the default suggestion, but I would argue it’s even stranger (and far more concerning) for us to provide a default that completely circumvents any protections provided by Docker (we’ve already stripped most security out as is, but directly exposing the Docker API socket is just about the most dangerous thing you can do).

hugovalente-pm commented 2 years ago

@AlexNti after the chat with @Ferroin let's review the structure for Docker and Kubernetes I've done the suggested changes on here https://docs.google.com/spreadsheets/d/1qTZpItEVta3a0loqfiva0J4W4l2FOVrH53pHE9wRQgc/edit#gid=0

ilyam8 commented 2 years ago

@Ferroin my point is what you are suggesting sounds correct in theory but completely fails in practice (assuming the goal is to have docker Netdata with ~ exp to Netdata installed on the host). I personally have a hard time adding everything I need to start Netdata docker container with full capabilities. I need to copy one thing, jump here and there and copy some parts, don't miss anything, and don't make an error. For me it is very annoying, I just want to be able copy and paste.

Also, I don't understand your concerns to be honest:

cakrit commented 2 years ago

May I suggest an alternative? Provide two options, one with full access and one without, both with copy-able stuff. Easiest (but clunky) would be to show them one below the other, best would be to add another binary under the "Docker tab" that changes the commands based on a user option.

image
hugovalente-pm commented 2 years ago

probably we should deal with that on another ticket? the scope was to add the Docker Swarm part and some notes.

what we are doing here doesn't contradict a future iteration, meaning that users that don't want the full pre-setup configuration still need those "Notes:" callout to know what do to if they need

idea is to not further disrupt the work already being done and already deliver some value to the users