matusnovak / prometheus-zfs

Prometheus exporter for (some) ZFS statistics via zpool iostatus and zfs get space
The Unlicense
18 stars 4 forks source link

Segmentation fault on docker container (even when running as privileged) #4

Open taw123 opened 1 year ago

taw123 commented 1 year ago

First thank you for creating exactly what I was looking for, after moving NAS to ZFS. I need to rework my Docker dashboard that's in Grafana to use the zfs shares/pools rather than the old EXT3 mounts.

Unfortunately I am getting a segmentation fault on launch of your container (just thrashes). I verified the container I running as privileged and don't see anything unusual in the compose script (straight copy/paste from your readme)...

Any thoughts?

Container list:

CONTAINER ID   IMAGE                                        COMMAND                   CREATED          STATUS                           PORTS                                                                                                          NAMES
13229d380541   matusnovak/prometheus-zfs:latest             "/bin/sh -c \"./zfspr…"   51 minutes ago   Restarting (139) 3 seconds ago                                                                                                                  grafana_stack-zfs-metrics-1
acc89a09c03b   grafana/grafana:latest                       "/run.sh"                 56 minutes ago   Up 56 minutes                    0.0.0.0:3000->3000/tcp                                                                                         Grafana
f4cea04f509f   prom/prometheus:latest                       "/bin/prometheus --c…"    56 minutes ago   Up 56 minutes                    0.0.0.0:9090->9090/tcp                                                                                         Prometheus

Verification that your ZFS export container is/was/trying to run as Privileged:

# docker inspect --format='{{.HostConfig.Privileged}}' 13229d380541 acc89a09c03b f4cea04f509f
true
false
false

FYI- I'm also running cAdvise (privileged) and of course Portainer is also running privileged so shouldn't be an issue with securing escalation, Orr any need to run Prometheus as privileged either, so I'm kinda pulling my hair out here as there isn't much to work with in the log for the container beyond just the seg fault.

Extract from my from stack compose:

  zfs-metrics:
    image: matusnovak/prometheus-zfs:latest
    hostname: zfs-metrics.local
    restart: unless-stopped
    privileged: true
    ports:
      - 9901:9901

Log:

Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault

Thanks again for the efforts here and any insights you might have! --T

matusnovak commented 1 year ago

Hi. Sorry for a late reply.

Huh, interesting. I suspect the fault might be with the call to ZFS that happens from the Python script. This has never happened to me before.

Could you try re-building the docker image? The matusnovak/prometheus-zfs:latest comes from Docker Hub and was built in 2021.

You could build it yourself:

git clone https://github.com/matusnovak/prometheus-zfs.git
cd prometheus-zfs
docker build -t matusnovak/prometheus-zfs:latest .

And then delete and run your container again.

Just to test out if a simple update to Python and dependencies will solve the problem.

taw123 commented 1 year ago

No worries about the delay I too got buried here sorry about that. Should be better now..

So, some progress.....

Built image as you suggested and while I don't get a segmentation error now still have a bit of an issue... As it doesn't look like I am getting any meaningful data emitted...

I have compose stack with a number of other monitors (cAdvisor, Prometheus, node-exporter, etc) to which I added your ZFS exporter. When it failed I just commented out the call from my compose. I cloned, and built as you suggested. Nuked the container and checked Portainer to ensure my stack picked up the same container-ID I noted before updating the stack.

output on port 9901

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 102.0
python_gc_objects_collected_total{generation="1"} 289.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 44.0
python_gc_collections_total{generation="1"} 3.0
python_gc_collections_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="6",version="3.10.6"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.84520704e+08
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 9.797632e+06
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.68143066046e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.7
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 6.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 65535.0
# HELP zfsprom_active Active state
# TYPE zfsprom_active gauge
# HELP zfsprom_size Size (bytes)
# TYPE zfsprom_size gauge
# HELP zfsprom_alloc Allocated space (bytes)
# TYPE zfsprom_alloc gauge
# HELP zfsprom_free Free space (bytes)
# TYPE zfsprom_free gauge
# HELP zfsprom_op_read Operations read
# TYPE zfsprom_op_read gauge
# HELP zfsprom_op_write Operations write
# TYPE zfsprom_op_write gauge
# HELP zfsprom_bw_read Bandwidth read (bytes)
# TYPE zfsprom_bw_read gauge
# HELP zfsprom_bw_write Bandwidth write (bytes)
# TYPE zfsprom_bw_write gauge
# HELP zfsprom_errors_read Read errors
# TYPE zfsprom_errors_read gauge
# HELP zfsprom_errors_write Write errors
# TYPE zfsprom_errors_write gauge
# HELP zfsprom_errors_cksum Checksum errors
# TYPE zfsprom_errors_cksum gauge
# HELP zfsprom_disk_status Disk status
# TYPE zfsprom_disk_status gauge

only line of console data from the container doesn't seem helpful No log line matching the '' filter

Log from image build:

[~] # cd test/prometheus-zfs/
[~/test/prometheus-zfs] # docker build -t matusnovak/prometheus-zfs:latest .
Sending build context to Docker daemon  94.21kB
Step 1/10 : FROM ubuntu:latest
 ---> 08d22c0ceb15
Step 2/10 : WORKDIR /usr/src
 ---> Using cache
 ---> 2f865a2ddd9a
Step 3/10 : RUN apt-get update &&     apt-get install --yes --no-install-recommends build-essential git python3 python3-dev python3-pip libzfslinux-dev
 ---> Using cache
 ---> cd3ee0d91761
Step 4/10 : RUN python3 -m pip install setuptools prometheus_client Cython
 ---> Using cache
 ---> e2f13a7b3d80
Step 5/10 : RUN git clone https://github.com/truenas/py-libzfs.git /tmp/py-libzfs
 ---> Using cache
 ---> 7d46436d8a60
Step 6/10 : RUN cd /tmp/py-libzfs &&     ./configure --prefix=/usr &&     make build &&     python3 setup.py install
 ---> Using cache
 ---> 00f72a9d3f8e
Step 7/10 : RUN apt-get remove --yes build-essential git python3-dev python3-pip libzfslinux-dev && rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 6ee5b3ef690c
Step 8/10 : ADD zfsprom.py .
 ---> 168e11def62d
Step 9/10 : EXPOSE 9901
 ---> Running in 0414998777d6
Removing intermediate container 0414998777d6
 ---> 4e13bb3464c0
Step 10/10 : ENTRYPOINT "./zfsprom.py"
 ---> Running in 6535397b822c
Removing intermediate container 6535397b822c
 ---> 9893650c9a27
Successfully built 9893650c9a27
Successfully tagged matusnovak/prometheus-zfs:latest

Thanks again for all the help, seems like we are moving forward....👍 --T

matusnovak commented 1 year ago

Good to know rebuilding fixes the segmentation fault. I will trigger a new latest build.

The output you are getting is mostly standard metrics from the prometheus_client library. The ZFS parts should have metrics for your pools. They are definitely missing.

As for the No log line matching the '' filter I do not think that is going to be related to this Python script. Could you let me know the ZFS version you are using on your system? Perhaps the problem might be with the ZFS Python library.

However, since you have mentioned Portainer, I have found the exact same problem here: https://github.com/portainer/portainer/issues/6119

Could you also try running the container manually and not through the Portainer?

If that does not work either, could you try running it directly on the host without Docker? There are instructions here: https://github.com/matusnovak/prometheus-zfs#install but you don't need to run it as a service. Simply running it like this is sufficient:

# Needed for libzfs for Python
sudo apt-get install --yes --no-install-recommends libzfslinux-dev

# Needed for exporting metrics and Cython is needed by libzfs
sudo -H python3 -m pip install prometheus_client Cython

# Run the script
export ZPOOL_SCRIPTS_AS_ROOT=1
sudo ./zfsprom.py

I am just trying to isolate the problem.