openv / vcontrold

:fire: vcontrold Daemon for control and logging of Viessmann® type heating devices
https://github.com/openv/openv/wiki
GNU General Public License v3.0
102 stars 55 forks source link

Bug/Fault: vcontrold stops working daily at 6:25 #119

Closed oriziena closed 4 months ago

oriziena commented 1 year ago

Hi all,

I am totally lost and cant find the root cause. Running vcontrold since years but after upgrading it to 0.98.12 tag it suddently started to "stop" working daily at 6:25. I already had a look to all daily cron jobs but at least I doubt that some of them are responsible for the current behaviour.

[42941] Mon Jan  9 06:25:51 2023 : Typ: short (in float: 72.000000)
[42941] Mon Jan  9 06:25:51 2023 : (FLOAT) Exp: V/10 [B0:48 B1:00 B2:00 B3:00 B4:00 B5:00 B6:00 B7:00 B8:00 B9:00 BA:00 BB:00 BC:00 BD:00 BE:00 BF:00 ]
[42941] Mon Jan  9 06:25:51 2023 : 7.200000 Grad Celsius
[42941] Mon Jan  9 06:25:51 2023 : Command: getTempSolarZulauf
[42941] Mon Jan  9 06:25:51 2023 : >SENT: 04
[42941] Mon Jan  9 06:25:51 2023 : >FRAMER: no preset result
[42941] Mon Jan  9 06:25:51 2023 : Waiting for 05
[42941] Mon Jan  9 06:25:53 2023 : <RECV: 05 (1990.0 ms)
[42941] Mon Jan  9 06:25:53 2023 : >SENT: 01
[42941] Mon Jan  9 06:25:53 2023 : >SENT: F7
[42941] Mon Jan  9 06:25:53 2023 : >SENT: 65
[42941] Mon Jan  9 06:25:53 2023 : >SENT: 66
[42941] Mon Jan  9 06:25:53 2023 : >SENT: 02
[42941] Mon Jan  9 06:25:53 2023 : >FRAMER: no preset result
[42941] Mon Jan  9 06:25:53 2023 : <RECV: len=1 09 (20.0 ms)
[42941] Mon Jan  9 06:25:53 2023 : <RECV: len=1 01 (0.0 ms)
[42941] Mon Jan  9 06:25:53 2023 : <RECV: received 09 01
[42941] Mon Jan  9 06:25:53 2023 : Typ: short (in float: 265.000000)
[42941] Mon Jan  9 06:25:53 2023 : (FLOAT) Exp: V/10 [B0:09 B1:01 B2:00 B3:00 B4:00 B5:00 B6:00 B7:00 B8:00 B9:00 BA:00 BB:00 BC:00 BD:00 BE:00 BF:00 ]
[42941] Mon Jan  9 06:25:53 2023 : 26.500000 Grad Celsius
[42941] Mon Jan  9 06:25:53 2023 : Command: quit
[42941] Mon Jan  9 06:25:53 2023 : Process 42941 released lock
[42941] Mon Jan  9 06:25:53 2023 : Closed connection (fd:1)
[42941] Mon Jan  9 06:25:53 2023 : Child process with PID 42941 terminated
[676] Mon Jan  9 06:26:53 2023 : Client connected 127.0.0.1:49772 (FD:1)
[676] Mon Jan  9 06:26:53 2023 : Child process started with pid 42977
[42977] Mon Jan  9 06:26:53 2023 : Command: getBrennerStarts
[42977] Mon Jan  9 06:26:53 2023 : Process 42977 tries to aquire lock
[676] Mon Jan  9 06:27:53 2023 : Client connected 127.0.0.1:59510 (FD:1)
[676] Mon Jan  9 06:27:53 2023 : Child process started with pid 43000
[43000] Mon Jan  9 06:27:53 2023 : Command: getBrennerStarts
[43000] Mon Jan  9 06:27:53 2023 : Process 43000 tries to aquire lock
[676] Mon Jan  9 06:28:53 2023 : Client connected 127.0.0.1:35648 (FD:1)
[676] Mon Jan  9 06:28:53 2023 : Child process started with pid 43021
[43021] Mon Jan  9 06:28:53 2023 : Command: getBrennerStarts
[43021] Mon Jan  9 06:28:53 2023 : Process 43021 tries to aquire lock
[676] Mon Jan  9 06:29:53 2023 : Client connected 127.0.0.1:57312 (FD:1)
[676] Mon Jan  9 06:29:53 2023 : Child process started with pid 43047
[43047] Mon Jan  9 06:29:53 2023 : Command: getBrennerStarts
[43047] Mon Jan  9 06:29:53 2023 : Process 43047 tries to aquire lock

Any idea/help is highly appreciated. Thanks!

System Ubuntu 22.04.1 LTS with latest patches Kernel 5.15.0-1022-raspi aarch64 (rpi4)

BR Oriziena

oriziena commented 1 year ago

and btw...

if versions are tagged. shouldnt files also contain these tag version? as you can see I checked out v0.98.12 but it still contains 0.98.10-9-g977e6f5 which seems hardcoded?

root@cloud /opt/vcontrold (git)-[tags/v0.98.12] # cat src/version.h
/* Copyright (c) 2017 Tobias Leupold <tobias.leupold@gmx.de>

   This program is free software: you can redistribute it and/or modify
   it under the terms of the GNU General Public License as published by
   the Free Software Foundation, either version 3 of the License, or
   (at your option) any later version.

   This program is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
   GNU General Public License for more details.

   You should have received a copy of the GNU General Public License
   along with this program.  If not, see <http://www.gnu.org/licenses/>.
*/

#ifndef VERSION_H
#define VERSION_H

#define VERSION "0.98.10-9-g977e6f5"

#endif // VERSION_H

define VERSION "0.98.10-9-g977e6f5" ==> should be 0.98.12 ?

l3u commented 1 year ago

Hi Andreas!

This is a generated file that has been created in src/ quite some time ago. Meanwhile, it is created inside the build directory (to keep the sources clean). In src/, you only find the cmake template file version.h.in cmake uses to create the actual file.

TL;DR: This is some remnant of an old version of vcontrold you still have in your sources, because it was created at some point in the past and isn't versioned. It's not used anymore and you can simply delete it.

Cheers, Tobias

oriziena commented 1 year ago

Hi @l3u,

I now removed my local copy of vcontrold and made a complete new checkout ... rebuild all ...

 vcontrold -V
vcontrold version 0.98.12

Why ever a git checkout v.0.98.12 and building from scracht previously didn't work...

I will now monitor tomorrow morning if the issue around 6:25 is gone. I still have no idea why it is happening every day at this time :-(

BR Andreas

l3u commented 1 year ago

What problems do you have?! Works without a problem here:

$ git clone https://github.com/openv/vcontrold.git
$ cd vcontrold/
$ mkdir build
$ cd build/
$ git checkout v0.98.12
$ cmake -G Ninja -DMANPAGES=OFF ..
$ ninja -j6
oriziena commented 1 year ago

Hi Tobias,

for whatever reason I checked out v.0.98.12 and still had those old version.c file

I execute the installation as documented in docs/INSTALL.md

mkdir build
cd build
cmake ..
make
make install

how ever.. the real issue for this ticket is, that vcontrold stops working daily at 6:25 and I have no idea why. It tries to acquire lock and then nothing happens. I already checked cron.daily files which are executed at 6:25 but cant find any hint . Also executed them manually and vcontrold resumes running :-/

l3u commented 1 year ago

Sadly, I neither can reproduce your timing issue, nor your build problems … :-(

oriziena commented 1 year ago

okay.. no worries.. will resume search the rootcause in my system..

closing this :)

oriziena commented 1 year ago

Hi,

leider gehen mir die Ideen aus. Ich habe nun Raspian latest 64bit frisch installiert.

Noch immer geht vcontrold jeden morgen in folgende Problematik:

[421451] Wed Jan 18 06:24:57 2023 : 48.000000 Grad Celsius
[421451] Wed Jan 18 06:24:57 2023 : Command: getTempWWIst
[421451] Wed Jan 18 06:24:57 2023 : >SENT: 04
[421451] Wed Jan 18 06:24:57 2023 : >FRAMER: no preset result
[421451] Wed Jan 18 06:24:57 2023 : Waiting for 05
[421451] Wed Jan 18 06:24:59 2023 : <RECV: 05 (2000.0 ms)
[421451] Wed Jan 18 06:24:59 2023 : >SENT: 01
[421451] Wed Jan 18 06:24:59 2023 : >SENT: F7
[421451] Wed Jan 18 06:24:59 2023 : >SENT: 08
[421451] Wed Jan 18 06:24:59 2023 : >SENT: 12
[421451] Wed Jan 18 06:24:59 2023 : >SENT: 02
[421451] Wed Jan 18 06:24:59 2023 : >FRAMER: no preset result
[421451] Wed Jan 18 06:24:59 2023 : <RECV: len=1 76 (20.0 ms)
[421451] Wed Jan 18 06:24:59 2023 : <RECV: len=1 01 (0.0 ms)
[421451] Wed Jan 18 06:24:59 2023 : <RECV: received 76 01
[421451] Wed Jan 18 06:24:59 2023 : Typ: short (in float: 374.000000)
[421451] Wed Jan 18 06:24:59 2023 : (FLOAT) Exp: V/10 [B0:76 B1:01 B2:00 B3:00 B4:00 B5:00 B6:00 B7:00 B8:00 B9:00 BA:00 BB:00 BC:00 BD:00 BE:00 BF:00 ]
[421451] Wed Jan 18 06:24:59 2023 : 37.400002 Grad Celsius
[421451] Wed Jan 18 06:24:59 2023 : Command: getTempSolarDach
[421451] Wed Jan 18 06:24:59 2023 : >SENT: 04
[421451] Wed Jan 18 06:24:59 2023 : >FRAMER: no preset result
[421451] Wed Jan 18 06:24:59 2023 : Waiting for 05
[421451] Wed Jan 18 06:25:01 2023 : <RECV: 05 (1990.0 ms)
[421451] Wed Jan 18 06:25:01 2023 : >SENT: 01
[421451] Wed Jan 18 06:25:01 2023 : >SENT: F7
[421451] Wed Jan 18 06:25:01 2023 : >SENT: 65
[421451] Wed Jan 18 06:25:01 2023 : >SENT: 64
[421451] Wed Jan 18 06:25:01 2023 : >SENT: 02
[421451] Wed Jan 18 06:25:01 2023 : >FRAMER: no preset result
[421451] Wed Jan 18 06:25:01 2023 : <RECV: len=1 F2 (50.0 ms)
[421451] Wed Jan 18 06:25:01 2023 : <RECV: len=1 FF (0.0 ms)
[421451] Wed Jan 18 06:25:01 2023 : <RECV: received F2 FF
[421451] Wed Jan 18 06:25:01 2023 : Typ: short (in float: -14.000000)
[421451] Wed Jan 18 06:25:01 2023 : (FLOAT) Exp: V/10 [B0:F2 B1:FF B2:00 B3:00 B4:00 B5:00 B6:00 B7:00 B8:00 B9:00 BA:00 BB:00 BC:00 BD:00 BE:00 BF:00 ]
[421451] Wed Jan 18 06:25:01 2023 : -1.400000 Grad Celsius
[421451] Wed Jan 18 06:25:01 2023 : Command: getTempSolarZulauf
[421451] Wed Jan 18 06:25:01 2023 : >SENT: 04
[421451] Wed Jan 18 06:25:01 2023 : >FRAMER: no preset result
[421451] Wed Jan 18 06:25:01 2023 : Waiting for 05
[421451] Wed Jan 18 06:25:03 2023 : <RECV: 05 (1990.0 ms)
[421451] Wed Jan 18 06:25:03 2023 : >SENT: 01
[421451] Wed Jan 18 06:25:03 2023 : >SENT: F7
[421451] Wed Jan 18 06:25:03 2023 : >SENT: 65
[421451] Wed Jan 18 06:25:03 2023 : >SENT: 66
[421451] Wed Jan 18 06:25:03 2023 : >SENT: 02
[421451] Wed Jan 18 06:25:03 2023 : >FRAMER: no preset result
[421451] Wed Jan 18 06:25:03 2023 : <RECV: len=1 16 (30.0 ms)
[421451] Wed Jan 18 06:25:03 2023 : <RECV: len=1 01 (0.0 ms)
[421451] Wed Jan 18 06:25:03 2023 : <RECV: received 16 01
[421451] Wed Jan 18 06:25:03 2023 : Typ: short (in float: 278.000000)
[421451] Wed Jan 18 06:25:03 2023 : (FLOAT) Exp: V/10 [B0:16 B1:01 B2:00 B3:00 B4:00 B5:00 B6:00 B7:00 B8:00 B9:00 BA:00 BB:00 BC:00 BD:00 BE:00 BF:00 ]
[421451] Wed Jan 18 06:25:03 2023 : 27.799999 Grad Celsius
[421451] Wed Jan 18 06:25:03 2023 : Command: quit
[421451] Wed Jan 18 06:25:03 2023 : Process 421451 released lock
[421451] Wed Jan 18 06:25:03 2023 : Closed connection (fd:1)
[421451] Wed Jan 18 06:25:03 2023 : Child process with PID 421451 terminated
[333552] Wed Jan 18 06:26:03 2023 : Client connected 127.0.0.1:41646 (FD:1)
[333552] Wed Jan 18 06:26:03 2023 : Child process started with pid 421531
[421531] Wed Jan 18 06:26:03 2023 : Command: getBrennerStarts
[421531] Wed Jan 18 06:26:03 2023 : Process 421531 tries to aquire lock
[333552] Wed Jan 18 06:27:03 2023 : Client connected 127.0.0.1:42798 (FD:1)
[333552] Wed Jan 18 06:27:03 2023 : Child process started with pid 421577
[421577] Wed Jan 18 06:27:03 2023 : Command: getBrennerStarts
[421577] Wed Jan 18 06:27:03 2023 : Process 421577 tries to aquire lock
[333552] Wed Jan 18 06:28:03 2023 : Client connected 127.0.0.1:35262 (FD:1)
[333552] Wed Jan 18 06:28:03 2023 : Child process started with pid 421597
[421597] Wed Jan 18 06:28:03 2023 : Command: getBrennerStarts
[421597] Wed Jan 18 06:28:03 2023 : Process 421597 tries to aquire lock
[333552] Wed Jan 18 06:29:03 2023 : Client connected 127.0.0.1:33584 (FD:1)
[333552] Wed Jan 18 06:29:03 2023 : Child process started with pid 421621
[421621] Wed Jan 18 06:29:03 2023 : Command: getBrennerStarts
[421621] Wed Jan 18 06:29:03 2023 : Process 421621 tries to aquire lock
[333552] Wed Jan 18 06:30:03 2023 : Client connected 127.0.0.1:54956 (FD:1)
[333552] Wed Jan 18 06:30:03 2023 : Child process started with pid 421669
anzimber@cloud ~ % lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye
anzimber@cloud ~ % uname -a
Linux cloud 5.15.87-v8+ #1616 SMP PREEMPT Fri Jan 13 17:01:45 GMT 2023 aarch64 GNU/Linux

Ich verzweifel!

BR Andreas

speters commented 1 year ago

See /etc/crontab to look for stuff being launched at 6h25. Most likely you'll see an anacron entry for the stuff in /etc/cron.daily.

With this confirmed you can hunt down the cron job which causes this issue. A suspect is the logrotate job. You should also have a look if there is some clean-up taking place in in-memory filesystems. If in doubt move your log and lock file to a different location. If one of these is the culprit, you can try out changes to the vcontrold sourcecode to make sure file location of a file descriptor still exists (please supply patch in case you fixed it).

A simple workaround would be to just start/stop the vcontrold automatically before/after the problem-causing job is run.

oriziena commented 1 year ago

Hi, for testing purposes I removed logrotate job. lets see tomorrow. Thanks Andreas

oriziena commented 1 year ago

Good morning, the issue persists. I now deactivated all daily cron tabs .. lets wait for tomorrow. BR Andreas

oriziena commented 1 year ago

hi,

sorry for my late follow up.

after I removed all systemd timers and daily cron jobs the vcontrold is working ...

will now resume to find the root causing systemd timer/cron job.

br

andreas

speters commented 1 year ago

vcontrold has a bad smell in some places.

The semaphore stuff is one of these places, as its key relies on ftok() which may or may not (depending on implementation) give different keys when the file is the same, but not identic.

Please try this small patch:

diff --git i/src/semaphore.c w/src/semaphore.c
index 33d8ffd..f57b082 100644
--- i/src/semaphore.c
+++ w/src/semaphore.c
@@ -110,7 +110,7 @@ int vcontrol_seminit()
     sb.sem_op = -1; // set to allocate resource
     sb.sem_flg = SEM_UNDO;

-    if ((key = ftok(tmpfilename, 'V')) == -1) {
+    if ((key = ftok("/dev/null", 'V')) == -1) {
         perror("ftok");
         exit(1);
     }
oriziena commented 4 months ago

I moved to homeassistant and cannot see that behaviour anymore... therfore closing...

oriziena commented 4 months ago

closing