openv / vcontrold

:fire: vcontrold Daemon for control and logging of Viessmann® type heating devices
https://github.com/openv/openv/wiki
GNU General Public License v3.0
102 stars 57 forks source link

vcontrold becomes unresponsive after some failed requests #30

Closed l3u closed 6 years ago

l3u commented 6 years ago

I just noticed the following:

Today, the heating engineer did some maintaing on my heating. He removed the optolink adapter for an hour or so. I have a Raspberry Pi running that requests some data each five minutes, so some requests wil have failed in that time.

Even after replugging the adapter, I could not request any data anymore, although the daemon was apparently still running. After restarting the daemon, everything worked normally again.

So I think something goes wrong when a lot of requests fail? Surely not a severe issue, but probably something we want to fix.

Cheers, Tobias

anse1 commented 6 years ago

I tried switching from homebrew expect scripts to vcontrold today, but ran into a freezing issue as well. I have a high bit error rate on my serial line, and if one happens at the right time, vcontrold becomes unresponsive with the following symptoms:

After a couple dozen successful data transfers the clients starts logging:

[25790] Thu May 10 23:17:54 2018 : timeout wait:vctrld>
[25790] Thu May 10 23:17:54 2018 : Error communicating with the server
[25847] Thu May 10 23:18:20 2018 : timeout wait:vctrld>
[25847] Thu May 10 23:18:20 2018 : Error communicating with the server

The last words from the server are:

SRV ERR: >FRAMER: unexpected data 0xFFFFFF81
>FRAMER: unexpected data 0xFFFFFF81
>FRAMER: unexpected data 0xFFFFFF81
>FRAMER: could not close (3 attempts)
>FRAMER: could not set start condition
Error opening /dev/ttyUSB1

ttyUSB1 was fine except for a burst of bit errors. Once this happens, each connection attempt creates a new vcontrold that is stuck waiting for a semaphore and will not go away when the client closes the connection due to timeout:

ssh/fusion:~$ ps auxwww|grep -c vcontrold
47
(gdb) bt
#0  0x00007f5e81c5ce07 in semop () at ../sysdeps/unix/syscall-template.S:81
#1  0x000000000040cc94 in vcontrol_semget () at /home/andreas/vcontrold/src/semaphore.c:158
#2  0x00000000004127bb in interactive (socketfd=1, device=0x1fb66a0 "/dev/ttyUSB1") at /home/andreas/vcontrold/src/vcontrold.c:437
#3  0x0000000000413f6c in main (argc=3, argv=0x7fff01c6d678) at /home/andreas/vcontrold/src/vcontrold.c:995

I'm switching back to talking to the vitotronic directly again since I don't have time to debug vcontrold further to make it robust against less than perfect serial links…

hmueller01 commented 6 years ago

After upgrading my Raspberry (just using apt-get on Jessie) I run into the same problems. I get lots of unexpected data and finally the USB tty device gets an open error. After that I get lots of started vcontrold processes which do not terminate (like @anse1 reported). Is someone out there to solve the problem that vcontrold does not exit in case of communication problems? Thanks!

anse1 commented 6 years ago

Hi Holger,

since it started with the upgrade, it could be that systemd also started using your /dev/ttyAMA0 and stealing random chunks of the input, leading to the protocol errors. You can verify this using the following command:

systemctl status serial-getty@ttyAMA0.service

If that's active, I think the following should make systemd stop messing up the communication:

systemctl mask serial-getty@ttyAMA0.service systemctl disable serial-getty@ttyAMA0.service systemctl stop serial-getty@ttyAMA0.service

regards, Andreas

l3u commented 6 years ago

I don't think that this is systemd related. At least, my Raspberry Pi runs Gentoo on OpenRC, not systemd.

hmueller01 commented 6 years ago

Hi Andreas,

thanks a lot for your help. But the service is dead (not active). Maybe I've got a hardware problem or something like that. If I look into the debug logs I see

[7473] Mon Sep 10 21:31:04 2018 : Command: getGwgBetriebsart_M2
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 41
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 05
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 00
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 33
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 23
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 5D
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 06 (30.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 06
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: framer_set_actaddr framer_current_addr = 2333 (was FFFF)
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: Command send
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: no preset result
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 41 (0.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 41
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 06 (20.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 06
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=5 01 (0.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=2 00 (10.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 01 01 33 23 01 00 5F
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_NO_ADDR (was 2333)
[7473] Mon Sep 10 21:31:05 2018 : Received: 00 
[7473] Mon Sep 10 21:31:05 2018 : Command: unit on
[7473] Mon Sep 10 21:31:05 2018 : Command: getStatusUmschaltVentil
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 41
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 05
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 00
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 0A
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 10
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 21
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 15 (120.0 ms)
7473] Mon Sep 10 21:31:05 2018 : >FRAMER: Error 0x15 != 0x06 (P300_INIT_OK)
[7473] Mon Sep 10 21:31:05 2018 : Error in send, terminating
[7473] Mon Sep 10 21:31:05 2018 : Error executing getStatusUmschaltVentil

... two more FRAMER Errors and then it works again ...

7473] Mon Sep 10 21:31:05 2018 : Command: getTempAbgas
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 41
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 05
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 00
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 08
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 08
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 02
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 18
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 06 (30.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 06
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: framer_set_actaddr framer_current_addr = 0808 (was FFFF)
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: Command send
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: no preset result
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=2 41 (20.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 41 07
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=4 01 (0.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=4 02 (10.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 01 01 08 08 02 27 01 43
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_NO_ADDR (was 0808)
[7473] Mon Sep 10 21:31:05 2018 : Typ: short (in float: 295.000000)
[7473] Mon Sep 10 21:31:05 2018 : (FLOAT) Exp: V/10 [B0:27 B1:01 B2:00 B3:00 B4:00 B5:00 B6:00 B7:00 B8:00 B9:00 ]
[7473] Mon Sep 10 21:31:05 2018 : 29.500000 Grad Celsius

Vito answers 0x15 (P300_ERROR) instead of 0x06 (P300_INIT_OK). Maybe I have this for some time already. But now after some time vcontrold is unable to open the device (Error opening /dev/vitoir0). I assume that for some reason the semaphore it not released so that vcontrold gets blocked. I can not find a "semop" message in my logs (does perror("semop") write to the log file?). I will see if I can reproduce the bug with debugging (-g) enabled ...

hmueller01 commented 6 years ago

After looking into the log

[10520] Mon Sep 10 22:31:01 2018 : Command: getTempRaumSollReduziert_M2
[10520] Mon Sep 10 22:31:01 2018 : Process 10520 tries to aquire lock
[10520] Mon Sep 10 22:31:01 2018 : Process 10520 got lock
[10520] Mon Sep 10 22:31:01 2018 : >FRAMER: open device /dev/vitoir0 ProtocolID 41
[10520] Mon Sep 10 22:31:01 2018 : Configuring serial interface /dev/vitoir0
[10520] Mon Sep 10 22:31:01 2018 : >SEND: 04
[10520] Mon Sep 10 22:31:03 2018 : <RECV: len=1 05 (1560.0 ms)
[10520] Mon Sep 10 22:31:03 2018 : <RECV: received 05
[10520] Mon Sep 10 22:31:03 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_LINK_STATUS(05) (was FFFF)
[10520] Mon Sep 10 22:31:03 2018 : >FRAMER: closed
[10520] Mon Sep 10 22:31:03 2018 : >SEND: 16
[10520] Mon Sep 10 22:31:03 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:03 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:05 2018 : <RECV: len=1 05 (2010.0 ms)
[10520] Mon Sep 10 22:31:05 2018 : <RECV: received 05
[10520] Mon Sep 10 22:31:05 2018 : >SEND: 16
[10520] Mon Sep 10 22:31:05 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:05 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:07 2018 : <RECV: len=1 05 (2020.0 ms)
[10520] Mon Sep 10 22:31:07 2018 : <RECV: received 05
[10520] Mon Sep 10 22:31:07 2018 : >SEND: 16
[10520] Mon Sep 10 22:31:07 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:07 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:09 2018 : <RECV: len=1 05 (2250.0 ms)
[10520] Mon Sep 10 22:31:09 2018 : <RECV: received 05
[10520] Mon Sep 10 22:31:09 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_LINK_STATUS(15) (was FE05)
[10520] Mon Sep 10 22:31:09 2018 : >FRAMER: could not close (3 attempts)
[10520] Mon Sep 10 22:31:09 2018 : Error opening /dev/vitoir0
[10520] Mon Sep 10 22:31:09 2018 : Command: getTempRaumSoll_M2
[10520] Mon Sep 10 22:31:09 2018 : Process 10520 tries to aquire lock
[7160] Mon Sep 10 22:32:02 2018 : Client connected 127.0.0.1:44676 (FD:1)

process 10520 tries to aquire lock and waits. At 22:32:02 the next query is started by my cron job. I think the bug is here:

            // We only open the device if we have something to do. But only if it's not open yet.
            if (fd < 0) {
                /* As one vclient call opens the link once, all is seen a transaction
                 * This may cause trouble for telnet sessions, as the whole session is
                 * one link activity, even more commands are given within.
                 * This is related to a accept/close on a server socket
                 */
                // everything on link is a transaction - all commands
                vcontrol_semget(); // TODO semjfi

                if ((fd = framer_openDevice(device, cfgPtr->devPtr->protoPtr->id)) == -1) {
                    logIT(LOG_ERR, "Error opening %s", device);
                    sendErrMsg(socketfd);
                    if (!Writen(socketfd, prompt, strlen(prompt))) {
                        sendErrMsg(socketfd);
                        framer_closeDevice(fd);
                        vcontrol_semrelease();  // TODO semjfi
                        return 0;
                    }
                    continue;
                }
            }

The semaphore is locked and framer_openDevice() failes for some reason. Writen() will return 7 (“vctrld>”) so the if case is not executed and we end up in continue. Trying to send the next command, fd is still -1. We do another semaphore lock without releasing before and are catched in a dead lock … I do not understand why the semaphore release is bound to a unsuccessful write of the prompt. I would propose to write:

                if ((fd = framer_openDevice(device, cfgPtr->devPtr->protoPtr->id)) == -1) {
                    logIT(LOG_ERR, "Error opening %s", device);
                    sendErrMsg(socketfd);
                    framer_closeDevice(fd);
                    vcontrol_semrelease();  // TODO semjfi
                    if (!Writen(socketfd, prompt, strlen(prompt))) {
                        sendErrMsg(socketfd);
                        return 0;
                    }
                    continue;
                }

Testing this at the moment and for the first hours it looks good.

hmueller01 commented 6 years ago

Ok, it looks like this patch really works. Take a look at the log ...

[23556] Tue Sep 11 19:59:07 2018 : Command: getTempRaumSoll_M2
[23556] Tue Sep 11 19:59:07 2018 : Process 23556 tries to aquire lock
[23556] Tue Sep 11 19:59:07 2018 : Process 23556 got lock
[23556] Tue Sep 11 19:59:07 2018 : >FRAMER: open device /dev/vitoir0 ProtocolID 41
[23556] Tue Sep 11 19:59:07 2018 : Configuring serial interface /dev/vitoir0
[23556] Tue Sep 11 19:59:07 2018 : >SEND: 04
[23556] Tue Sep 11 19:59:07 2018 : <RECV: len=1 06 (20.0 ms)
[23556] Tue Sep 11 19:59:07 2018 : <RECV: received 06
[23556] Tue Sep 11 19:59:07 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_LINK_STATUS(05) (was FE15)
[23556] Tue Sep 11 19:59:07 2018 : >FRAMER: closed
[23556] Tue Sep 11 19:59:07 2018 : >SEND: 16
[23556] Tue Sep 11 19:59:07 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:07 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:09 2018 : <RECV: len=1 05 (1990.0 ms)
[23556] Tue Sep 11 19:59:09 2018 : <RECV: received 05
[23556] Tue Sep 11 19:59:09 2018 : >SEND: 16
[23556] Tue Sep 11 19:59:09 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:09 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:11 2018 : <RECV: len=1 05 (1990.0 ms)
[23556] Tue Sep 11 19:59:11 2018 : <RECV: received 05
[23556] Tue Sep 11 19:59:11 2018 : >SEND: 16
[23556] Tue Sep 11 19:59:11 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:11 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:13 2018 : <RECV: len=1 05 (2250.0 ms)
[23556] Tue Sep 11 19:59:13 2018 : <RECV: received 05
[23556] Tue Sep 11 19:59:13 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_LINK_STATUS(15) (was FE05)
[23556] Tue Sep 11 19:59:13 2018 : >FRAMER: could not close (3 attempts)
[23556] Tue Sep 11 19:59:13 2018 : Error opening /dev/vitoir0
[23556] Tue Sep 11 19:59:13 2018 : Process 23556 released lock
[23556] Tue Sep 11 19:59:13 2018 : Command: getTempRaum_M2Tiefpass
[23556] Tue Sep 11 19:59:13 2018 : Process 23556 tries to aquire lock
[23556] Tue Sep 11 19:59:13 2018 : Process 23556 got lock

After Error opening /dev/vitoir0 the lock is released and we can get a new lock.