powerapi-ng / hwpc-sensor

Hardware Performance Counters monitoring agent for containers.
BSD 3-Clause "New" or "Revised" License
14 stars 16 forks source link

Sensor does connect to the formula after disconnection #11

Closed PierreRustOrange closed 2 years ago

PierreRustOrange commented 2 years ago

Test scenario:

The following command were used when testing:

Sensor:

docker run --privileged --rm --name sensor --network="host"  powerapi/hwpc-sensor \
         -v /sys:/sys  \
         -v /var/lib/docker/containers:/var/lib/docker/containers:ro     \
           -n sensor \
           -f 2000 \
           -r socket -U 127.0.0.1 -P 12000 \
           -s "rapl" -o -e "RAPL_ENERGY_PKG" \
           -s "msr"     -e "TSC" -e "APERF" -e "MPERF" \
           -c "core"    -e "CPU_CLK_THREAD_UNHALTED:REF_P" \
                        -e "CPU_CLK_THREAD_UNHALTED:THREAD_P" \
                        -e "LLC_MISSES"\
                    -e "INSTRUCTIONS_RETIRED"

BTW, the sensor in the latest docker image does not returns its own version : I: 21-10-26 15:28:23 build: version undefined (rev: undefined) (Sep 28 2021 - 14:40:24)

Formula:

python -m smartwatts --debug --config-file config_file.json

version: today's pull on master.

configuration file:

{
  "verbose": true,
  "stream": true,
  "input": {
    "puller": {
      "model": "HWPCReport",
      "type": "socket",
      "uri": "127.0.0.1",
      "port": 12000
    }
  },
  "output": {
    "pusher_power": {
      "type": "csv",
      "uri": ".",
      "model" : "PowerReport"
    }
  },
  "cpu-frequency-base": 2300,
  "cpu-frequency-min": 800,
  "cpu-frequency-max": 5100,
  "cpu-error-threshold": 2.0,
  "disable-dram-formula": true,
  "sensor-report-sampling-interval": 1000
}
PierreRustOrange commented 2 years ago

I had a quick look at this issue but actually I can't fond whats causing the sensor to stop when the socket is closed at the server side : In reporting_actor, errors when writing out reports seems to be silently ignored and I don't see any code reacting to connection loss. Obviously I'm missing something here, any idea on where I should look ? Thanks !

PierreRustOrange commented 2 years ago

I've finally found the root cause, when writing to the closed socket we get an EPIPE signal which not handled and stops the sensor. We should probably ignore this signal and handle the error manually by reconnecting the socket.