Closed l3u closed 6 years ago
I tried switching from homebrew expect scripts to vcontrold today, but ran into a freezing issue as well. I have a high bit error rate on my serial line, and if one happens at the right time, vcontrold becomes unresponsive with the following symptoms:
After a couple dozen successful data transfers the clients starts logging:
[25790] Thu May 10 23:17:54 2018 : timeout wait:vctrld>
[25790] Thu May 10 23:17:54 2018 : Error communicating with the server
[25847] Thu May 10 23:18:20 2018 : timeout wait:vctrld>
[25847] Thu May 10 23:18:20 2018 : Error communicating with the server
The last words from the server are:
SRV ERR: >FRAMER: unexpected data 0xFFFFFF81
>FRAMER: unexpected data 0xFFFFFF81
>FRAMER: unexpected data 0xFFFFFF81
>FRAMER: could not close (3 attempts)
>FRAMER: could not set start condition
Error opening /dev/ttyUSB1
ttyUSB1 was fine except for a burst of bit errors. Once this happens, each connection attempt creates a new vcontrold that is stuck waiting for a semaphore and will not go away when the client closes the connection due to timeout:
ssh/fusion:~$ ps auxwww|grep -c vcontrold
47
(gdb) bt
#0 0x00007f5e81c5ce07 in semop () at ../sysdeps/unix/syscall-template.S:81
#1 0x000000000040cc94 in vcontrol_semget () at /home/andreas/vcontrold/src/semaphore.c:158
#2 0x00000000004127bb in interactive (socketfd=1, device=0x1fb66a0 "/dev/ttyUSB1") at /home/andreas/vcontrold/src/vcontrold.c:437
#3 0x0000000000413f6c in main (argc=3, argv=0x7fff01c6d678) at /home/andreas/vcontrold/src/vcontrold.c:995
I'm switching back to talking to the vitotronic directly again since I don't have time to debug vcontrold further to make it robust against less than perfect serial links…
After upgrading my Raspberry (just using apt-get on Jessie) I run into the same problems. I get lots of unexpected data and finally the USB tty device gets an open error. After that I get lots of started vcontrold processes which do not terminate (like @anse1 reported). Is someone out there to solve the problem that vcontrold does not exit in case of communication problems? Thanks!
Hi Holger,
since it started with the upgrade, it could be that systemd also started using your /dev/ttyAMA0 and stealing random chunks of the input, leading to the protocol errors. You can verify this using the following command:
systemctl status serial-getty@ttyAMA0.service
If that's active, I think the following should make systemd stop messing up the communication:
systemctl mask serial-getty@ttyAMA0.service systemctl disable serial-getty@ttyAMA0.service systemctl stop serial-getty@ttyAMA0.service
regards, Andreas
I don't think that this is systemd related. At least, my Raspberry Pi runs Gentoo on OpenRC, not systemd.
Hi Andreas,
thanks a lot for your help. But the service is dead (not active). Maybe I've got a hardware problem or something like that. If I look into the debug logs I see
[7473] Mon Sep 10 21:31:04 2018 : Command: getGwgBetriebsart_M2
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 41
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 05
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 00
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 33
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 23
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:04 2018 : >SEND: 5D
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 06 (30.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 06
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: framer_set_actaddr framer_current_addr = 2333 (was FFFF)
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: Command send
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: no preset result
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 41 (0.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 41
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 06 (20.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 06
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=5 01 (0.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=2 00 (10.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 01 01 33 23 01 00 5F
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_NO_ADDR (was 2333)
[7473] Mon Sep 10 21:31:05 2018 : Received: 00
[7473] Mon Sep 10 21:31:05 2018 : Command: unit on
[7473] Mon Sep 10 21:31:05 2018 : Command: getStatusUmschaltVentil
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 41
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 05
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 00
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 0A
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 10
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 21
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 15 (120.0 ms)
7473] Mon Sep 10 21:31:05 2018 : >FRAMER: Error 0x15 != 0x06 (P300_INIT_OK)
[7473] Mon Sep 10 21:31:05 2018 : Error in send, terminating
[7473] Mon Sep 10 21:31:05 2018 : Error executing getStatusUmschaltVentil
... two more FRAMER Errors and then it works again ...
7473] Mon Sep 10 21:31:05 2018 : Command: getTempAbgas
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 41
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 05
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 00
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 01
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 08
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 08
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 02
[7473] Mon Sep 10 21:31:05 2018 : >SEND: 18
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=1 06 (30.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 06
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: framer_set_actaddr framer_current_addr = 0808 (was FFFF)
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: Command send
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: no preset result
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=2 41 (20.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 41 07
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=4 01 (0.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: len=4 02 (10.0 ms)
[7473] Mon Sep 10 21:31:05 2018 : <RECV: received 01 01 08 08 02 27 01 43
[7473] Mon Sep 10 21:31:05 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_NO_ADDR (was 0808)
[7473] Mon Sep 10 21:31:05 2018 : Typ: short (in float: 295.000000)
[7473] Mon Sep 10 21:31:05 2018 : (FLOAT) Exp: V/10 [B0:27 B1:01 B2:00 B3:00 B4:00 B5:00 B6:00 B7:00 B8:00 B9:00 ]
[7473] Mon Sep 10 21:31:05 2018 : 29.500000 Grad Celsius
Vito answers 0x15 (P300_ERROR) instead of 0x06 (P300_INIT_OK). Maybe I have this for some time already. But now after some time vcontrold is unable to open the device (Error opening /dev/vitoir0
). I assume that for some reason the semaphore it not released so that vcontrold gets blocked. I can not find a "semop" message in my logs (does perror("semop") write to the log file?). I will see if I can reproduce the bug with debugging (-g) enabled ...
After looking into the log
[10520] Mon Sep 10 22:31:01 2018 : Command: getTempRaumSollReduziert_M2
[10520] Mon Sep 10 22:31:01 2018 : Process 10520 tries to aquire lock
[10520] Mon Sep 10 22:31:01 2018 : Process 10520 got lock
[10520] Mon Sep 10 22:31:01 2018 : >FRAMER: open device /dev/vitoir0 ProtocolID 41
[10520] Mon Sep 10 22:31:01 2018 : Configuring serial interface /dev/vitoir0
[10520] Mon Sep 10 22:31:01 2018 : >SEND: 04
[10520] Mon Sep 10 22:31:03 2018 : <RECV: len=1 05 (1560.0 ms)
[10520] Mon Sep 10 22:31:03 2018 : <RECV: received 05
[10520] Mon Sep 10 22:31:03 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_LINK_STATUS(05) (was FFFF)
[10520] Mon Sep 10 22:31:03 2018 : >FRAMER: closed
[10520] Mon Sep 10 22:31:03 2018 : >SEND: 16
[10520] Mon Sep 10 22:31:03 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:03 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:05 2018 : <RECV: len=1 05 (2010.0 ms)
[10520] Mon Sep 10 22:31:05 2018 : <RECV: received 05
[10520] Mon Sep 10 22:31:05 2018 : >SEND: 16
[10520] Mon Sep 10 22:31:05 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:05 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:07 2018 : <RECV: len=1 05 (2020.0 ms)
[10520] Mon Sep 10 22:31:07 2018 : <RECV: received 05
[10520] Mon Sep 10 22:31:07 2018 : >SEND: 16
[10520] Mon Sep 10 22:31:07 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:07 2018 : >SEND: 00
[10520] Mon Sep 10 22:31:09 2018 : <RECV: len=1 05 (2250.0 ms)
[10520] Mon Sep 10 22:31:09 2018 : <RECV: received 05
[10520] Mon Sep 10 22:31:09 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_LINK_STATUS(15) (was FE05)
[10520] Mon Sep 10 22:31:09 2018 : >FRAMER: could not close (3 attempts)
[10520] Mon Sep 10 22:31:09 2018 : Error opening /dev/vitoir0
[10520] Mon Sep 10 22:31:09 2018 : Command: getTempRaumSoll_M2
[10520] Mon Sep 10 22:31:09 2018 : Process 10520 tries to aquire lock
[7160] Mon Sep 10 22:32:02 2018 : Client connected 127.0.0.1:44676 (FD:1)
process 10520 tries to aquire lock and waits. At 22:32:02 the next query is started by my cron job. I think the bug is here:
// We only open the device if we have something to do. But only if it's not open yet.
if (fd < 0) {
/* As one vclient call opens the link once, all is seen a transaction
* This may cause trouble for telnet sessions, as the whole session is
* one link activity, even more commands are given within.
* This is related to a accept/close on a server socket
*/
// everything on link is a transaction - all commands
vcontrol_semget(); // TODO semjfi
if ((fd = framer_openDevice(device, cfgPtr->devPtr->protoPtr->id)) == -1) {
logIT(LOG_ERR, "Error opening %s", device);
sendErrMsg(socketfd);
if (!Writen(socketfd, prompt, strlen(prompt))) {
sendErrMsg(socketfd);
framer_closeDevice(fd);
vcontrol_semrelease(); // TODO semjfi
return 0;
}
continue;
}
}
The semaphore is locked and framer_openDevice()
failes for some reason. Writen()
will return 7 (“vctrld>”) so the if
case is not executed and we end up in continue
. Trying to send the next command, fd
is still -1. We do another semaphore lock without releasing before and are catched in a dead lock …
I do not understand why the semaphore release is bound to a unsuccessful write of the prompt. I would propose to write:
if ((fd = framer_openDevice(device, cfgPtr->devPtr->protoPtr->id)) == -1) {
logIT(LOG_ERR, "Error opening %s", device);
sendErrMsg(socketfd);
framer_closeDevice(fd);
vcontrol_semrelease(); // TODO semjfi
if (!Writen(socketfd, prompt, strlen(prompt))) {
sendErrMsg(socketfd);
return 0;
}
continue;
}
Testing this at the moment and for the first hours it looks good.
Ok, it looks like this patch really works. Take a look at the log ...
[23556] Tue Sep 11 19:59:07 2018 : Command: getTempRaumSoll_M2
[23556] Tue Sep 11 19:59:07 2018 : Process 23556 tries to aquire lock
[23556] Tue Sep 11 19:59:07 2018 : Process 23556 got lock
[23556] Tue Sep 11 19:59:07 2018 : >FRAMER: open device /dev/vitoir0 ProtocolID 41
[23556] Tue Sep 11 19:59:07 2018 : Configuring serial interface /dev/vitoir0
[23556] Tue Sep 11 19:59:07 2018 : >SEND: 04
[23556] Tue Sep 11 19:59:07 2018 : <RECV: len=1 06 (20.0 ms)
[23556] Tue Sep 11 19:59:07 2018 : <RECV: received 06
[23556] Tue Sep 11 19:59:07 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_LINK_STATUS(05) (was FE15)
[23556] Tue Sep 11 19:59:07 2018 : >FRAMER: closed
[23556] Tue Sep 11 19:59:07 2018 : >SEND: 16
[23556] Tue Sep 11 19:59:07 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:07 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:09 2018 : <RECV: len=1 05 (1990.0 ms)
[23556] Tue Sep 11 19:59:09 2018 : <RECV: received 05
[23556] Tue Sep 11 19:59:09 2018 : >SEND: 16
[23556] Tue Sep 11 19:59:09 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:09 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:11 2018 : <RECV: len=1 05 (1990.0 ms)
[23556] Tue Sep 11 19:59:11 2018 : <RECV: received 05
[23556] Tue Sep 11 19:59:11 2018 : >SEND: 16
[23556] Tue Sep 11 19:59:11 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:11 2018 : >SEND: 00
[23556] Tue Sep 11 19:59:13 2018 : <RECV: len=1 05 (2250.0 ms)
[23556] Tue Sep 11 19:59:13 2018 : <RECV: received 05
[23556] Tue Sep 11 19:59:13 2018 : >FRAMER: framer_reset_actaddr framer_current_addr = FRAMER_LINK_STATUS(15) (was FE05)
[23556] Tue Sep 11 19:59:13 2018 : >FRAMER: could not close (3 attempts)
[23556] Tue Sep 11 19:59:13 2018 : Error opening /dev/vitoir0
[23556] Tue Sep 11 19:59:13 2018 : Process 23556 released lock
[23556] Tue Sep 11 19:59:13 2018 : Command: getTempRaum_M2Tiefpass
[23556] Tue Sep 11 19:59:13 2018 : Process 23556 tries to aquire lock
[23556] Tue Sep 11 19:59:13 2018 : Process 23556 got lock
After Error opening /dev/vitoir0
the lock is released and we can get a new lock.
I just noticed the following:
Today, the heating engineer did some maintaing on my heating. He removed the optolink adapter for an hour or so. I have a Raspberry Pi running that requests some data each five minutes, so some requests wil have failed in that time.
Even after replugging the adapter, I could not request any data anymore, although the daemon was apparently still running. After restarting the daemon, everything worked normally again.
So I think something goes wrong when a lot of requests fail? Surely not a severe issue, but probably something we want to fix.
Cheers, Tobias