Error: An unknown internal error occurred

intrepidcs / libicsneo

Intrepid Control Systems device communication library

Other

37 stars 32 forks source link

Error: An unknown internal error occurred #17

Closed tysonite closed 4 years ago

tysonite commented 5 years ago

There are messages like below that started to come to syslog continuously. There are already around ~25 000 messages like that, frequency of those messages increase. We are using recent versions of all CAN repositories.

The outcome is that CAN messages can't be received from the device.

CAN repos versions:

intrepid-socketcan-kernel-module 96836cb1869523063016b13ff22f5c273e6d5ef8
icsscand a69c41c9d5df747b73bfcbfe9d0c4991949a0f76
icsneoapi 694bf94eeaed933aabc6ea417be5d1cfe32d06a1

Hardware: OrangePI One Linux kernel: 4.13.15-sunxi #1 SMP Tue Nov 21 23:35:46 MSK 2017 armv7l armv7l armv7l GNU/Linux

application_name: icsneo-socketcan
full_message: message repeated 537 times: [ ValueCAN 3 121397 Error: An unknown internal error occurred.]

Can you please suggest how to debug it further?

hollinsky-intrepid commented 5 years ago

Hi Tysonite,

What sort of load is the device under? How much are you transmitting, approximately how much are you receiving, and are you using both channels on the device? If you have a bus capture, we could take a look for you as well to get a better idea of what's going on, you can send it to support@intrepidcs.com and mention me.

How long is the device running before you see these messages?

Some debugging steps:

If the device is connected to a USB hub or USB extension, try connecting it directly to the USB port on the host computer. Since you're using a single board computer, there's actually a chance that the USB ports on the computer itself are marginally underpowered. Check that the OrangePi has a suitable power supply (it appears that they recommend 5V 3A). Pin 9 on the ValueCAN 3 is a 6-40V DC input, so you could connect it to vehicle/bench power as well.
Check that everything is sharing a common ground. Pin 6 on the ValueCAN 3 is for ground, and this should be connected to the vehicle ground (or the same ground as the ECU, if testing on a bench).
Ensure that the firmware is up to date. You can download our device drivers and settings application (NeoVI Explorer) to a Windows computer from this link https://cdn.intrepidcs.net/updates/files/RP1210KitInstall.zip. Once this is installed, open NeoVI Explorer and connect to the device. The device will automatically update the firmware if there are any updates.
Reset to default settings. Again from the NeoVI Explorer application, connect to the device and select "Load Default Settings". From here, you can set any settings you need (baud rates, etc) and try again.

--Paul

tysonite commented 5 years ago

Hi @hollinsky-intrepid,

Thanks for detailed response.

Based on it, we have updated firmware on impacted device, and it looks to be working well now.

We are seeing one more

ValueCAN 3 121397 Error: There was a checksum error while decoding a packet. The packet was dropped.

error message on another device, but it does not have up-to-date firmware. We will try to update firmware, just in advance, if it will not help, what can be the reason if this error?

hollinsky-intrepid commented 5 years ago

Glad that updating the firmware helped. Both of the error messages you list are related and likely stem from the same underlying issue, so I suspect that updating the firmware on that one would help as well. Otherwise, I'd apply the same debugging steps as the previous.

--Paul

tysonite commented 5 years ago

We still observe checksum error after firmware update, reset to defaults and some playing with power/ground.

Is it possible to print in logs the frame/packet that has checksum error as is?

tysonite commented 5 years ago

Or more question: how to get a bus capture?

tysonite commented 5 years ago

@hollinsky-intrepid, can't reach out to you either here, or via support@intrepidcs.com. Any chance to get your support?

hollinsky-intrepid commented 5 years ago

I have created the checksum-failure-logging branch at https://github.com/intrepidcs/icsscand/tree/checksum-failure-logging which will cause the daemon to log these checksum errors directly to stderr.

Ensure also that the libicsneo gets updated when you switch to this branch of icsscand, git submodule update --recursive

Once running the revisions should be

The libicsneo SocketCAN Usermode Daemon
Copyright Intrepid Control Systems, Inc. 2019

Daemon v2.0.2 checksum-failure-logging
libicsneo v0.2.0 checksum-failure-logging @

Perhaps running this revision we can get to the bottom of the checksum errors.

tysonite commented 5 years ago

Thank you. Hope we will get new logs in journalctl as we running this daemon in a background. I will let you know the results.

tysonite commented 5 years ago

When we switched to checksum-failure-logging branch, the checksum error is not happening for already 3 days. Are there any other changes except additional logs?

The version of icsscand daemon where we see checksum error is:

The libicsneo SocketCAN Usermode Daemon
Copyright Intrepid Control Systems, Inc. 2019

Daemon v2.0.2 HEAD
libicsneo v0.1.2 HEAD @

The icsscand version with additional logging is:

The libicsneo SocketCAN Usermode Daemon
Copyright Intrepid Control Systems, Inc. 2019

Daemon v2.0.2 checksum-failure-logging
libicsneo v0.2.0 HEAD @

The output is not as mentioned by you for libicsneo, however, libicsneo sub-module points to:

icsscand/third-party/libicsneo$ git status
HEAD detached at f16ee63
nothing to commit, working directory clean

which looks to be correct.

One issue came up on checksum-failure-logging branch: at some point of time during week-end, CAN stopped working with error CanError: can.socketcan.ctypes failed to transmit. This error comes from Python, but Linux tools (e.g. candump any) did work as well. This has been resolved by restarting of icsscand daemon. No other errors in the log.

tysonite commented 5 years ago

The issue related to inability to send/receive any CAN message happened again. Some observations below.

The error that shows this issue:

$ cansend can1 001#0000
write: No buffer space available

The recovery method is to restart icsscand daemon. Please let us know how to debug this issue further.

And some outputs:

$ ifconfig
can0      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          UP RUNNING NOARP  MTU:72  Metric:1
          RX packets:1393 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1393 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10
          RX bytes:11144 (11.1 KB)  TX bytes:8 (8.0 B)

can1      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          UP RUNNING NOARP  MTU:72  Metric:1
          RX packets:5380670 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2034510 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10
          RX bytes:31664185 (31.6 MB)  TX bytes:8 (8.0 B)

$ strace cansend can1 001#0000
execve("/usr/bin/cansend", ["cansend", "can1", "001#0000"], [/* 32 vars */]) = 0
brk(NULL)                               = 0x14e5000
uname({sysname="Linux", nodename="ptc_lab_collab3_node15", ...}) = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6fc8000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=63024, ...}) = 0
mmap2(NULL, 63024, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb6f92000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/arm-linux-gnueabihf/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0(\0\1\0\0\0\331i\1\0004\0\0\0"..., 512) = 512
lseek(3, 894128, SEEK_SET)              = 894128
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2960) = 2960
lseek(3, 888324, SEEK_SET)              = 888324
read(3, "A2\0\0\0aeabi\0\1(\0\0\0\0057-A\0\6\n\7A\10\1\t\2\n\3\f"..., 51) = 51
fstat64(3, {st_mode=S_IFREG|0755, st_size=897088, ...}) = 0
mmap2(NULL, 963928, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb6ea6000
mprotect(0xb6f7c000, 65536, PROT_NONE)  = 0
mmap2(0xb6f8c000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xd6000) = 0xb6f8c000
mmap2(0xb6f8f000, 9560, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb6f8f000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb6fc7000
set_tls(0xb6fc74c0, 0xb6fc7b98, 0xb6fca050, 0xb6fc74c0, 0xb6fca050) = 0
mprotect(0xb6f8c000, 8192, PROT_READ)   = 0
mprotect(0x21000, 4096, PROT_READ)      = 0
mprotect(0xb6fc9000, 4096, PROT_READ)   = 0
munmap(0xb6f92000, 63024)               = 0
socket(PF_CAN, SOCK_RAW, 1)             = 3
access("/proc/net", R_OK)               = 0
access("/proc/net/unix", R_OK)          = 0
socket(PF_LOCAL, SOCK_DGRAM|SOCK_CLOEXEC, 0) = 4
ioctl(4, SIOCGIFINDEX, {ifr_name="can1", }) = 0
close(4)                                = 0
setsockopt(3, 0x65 /* SOL_?? */, 1, NULL, 0) = 0
bind(3, {sa_family=AF_CAN, sa_data="\203\276\7\0\0\0\0\0\0\0<\361\362\276"}, 16) = 0
write(3, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 16) = -1 ENOBUFS (No buffer space available)
dup(2)                                  = 4
fcntl64(4, F_GETFL)                     = 0x2 (flags O_RDWR)
brk(NULL)                               = 0x14e5000
brk(0x1506000)                          = 0x1506000
fstat64(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
write(4, "write: No buffer space available"..., 33write: No buffer space available
) = 33
close(4)                                = 0
exit_group(1)                           = ?
+++ exited with 1 +++

$ ip -details -statistics link show can1
7: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 0
    can state STOPPED restart-ms 0
          clock 0
          re-started bus-errors arbit-lost error-warn error-pass bus-off
          0          0          0          0          0          0
    RX: bytes  packets  errors  dropped overrun mcast
    31664185   5380670  0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    8          2034510  0       0       0       0

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
3172 root      20   0   53156   1184    724 S  18.2  0.2 950:31.71 icsscand

The stack trace of core dump gathered for core file dumped for running icsscand process while issue above is observed:

$ sudo gdb /usr/bin/icsscand core.3172 -ex "thread apply all bt" -ex "quit"
[sudo] password for orangepi:
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "arm-linux-gnueabihf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/bin/icsscand...(no debugging symbols found)...done.
[New LWP 3172]
[New LWP 3173]
[New LWP 3174]
[New LWP 3175]
[New LWP 3176]
[New LWP 3177]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `/usr/bin/icsscand'.
#0  0xb6f4a454 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
[Current thread is 1 (Thread 0xb6fc5000 (LWP 3172))]

Thread 6 (Thread 0xb44ff450 (LWP 3177)):
#0  0xb6f4a454 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
#1  0xb6f479a8 in do_futex_wait () from /lib/arm-linux-gnueabihf/libpthread.so.0
#2  0xb6f47b64 in __new_sem_wait_slow () from /lib/arm-linux-gnueabihf/libpthread.so.0
#3  0xb6f47bd4 in sem_timedwait () from /lib/arm-linux-gnueabihf/libpthread.so.0
#4  0x00054052 in icsneo::FTDI::writeTask() ()
#5  0xb6eb9dc8 in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#6  0xb6f415b4 in start_thread () from /lib/arm-linux-gnueabihf/libpthread.so.0
#7  0xb6dacc4c in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 5 (Thread 0xb4cff450 (LWP 3176)):
#0  0xb6da6bf0 in poll () from /lib/arm-linux-gnueabihf/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 4 (Thread 0xb56ff450 (LWP 3175)):
#0  0xb6f4a454 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
#1  0xb6f479a8 in do_futex_wait () from /lib/arm-linux-gnueabihf/libpthread.so.0
#2  0xb6f47b64 in __new_sem_wait_slow () from /lib/arm-linux-gnueabihf/libpthread.so.0
#3  0xb6f47bd4 in sem_timedwait () from /lib/arm-linux-gnueabihf/libpthread.so.0
#4  0x0005dfec in icsneo::ICommunication::readWait(std::vector<unsigned char, std::allocator<unsigned char> >&, std::chrono::duration<long long, std::ratio<1ll, 1000ll> >, unsigned int) ()
#5  0x0003cf7e in icsneo::Communication::readTask() ()
#6  0xb6eb9dc8 in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
---Type <return> to continue, or q <return> to quit---
#7  0xb6f415b4 in start_thread () from /lib/arm-linux-gnueabihf/libpthread.so.0
#8  0xb6dacc4c in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 3 (Thread 0xb5eff450 (LWP 3174)):
#0  0xb6da6bf0 in poll () from /lib/arm-linux-gnueabihf/libc.so.6
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0xb6875450 (LWP 3173)):
#0  0xb6f48dbe in nanosleep () from /lib/arm-linux-gnueabihf/libpthread.so.0
#1  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xb6fc5000 (LWP 3172)):
#0  0xb6f4a454 in __libc_do_syscall () from /lib/arm-linux-gnueabihf/libpthread.so.0
#1  0xb6f45dba in pthread_cond_wait@@GLIBC_2.4 () from /lib/arm-linux-gnueabihf/libpthread.so.0
#2  0xb6eb65a0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#3  0x0005e8aa in icsneo::ICommunication::write(std::vector<unsigned char, std::allocator<unsigned char> > const&) ()
#4  0x0004b9a2 in icsneo::Device::transmit(std::shared_ptr<icsneo::Message>) ()
#5  0x00032e10 in main ()

hollinsky-intrepid commented 4 years ago

Hi tysonite,

I can see where the issue is occurring. One thing that will help narrow down the problem: are you unable to receive any CAN during this time? i.e. not counting the CAN you are transmitting, can you see other CAN coming in?

--Paul

tysonite commented 4 years ago

According to candump any, there were no any CAN messages coming in/out at time when issue with write: No buffer space available was observed. But all devices/software was turned off so I can't confirm that CAN messages coming from remote end to OrangePI can or can't be dumped. This will be actually quite problematic to confirm as other device boots (and sends CAN messages) at a time when we send out CAN messages out from OrangePI.

Actually, we tried to do cansend can0 ... and there was no error like above.

hollinsky-intrepid commented 4 years ago

Okay. I will have to look further into the issue. Your cansend is failing due to the backpressure system when the write queues fill up. It would be expected that your writes would eventually fail if you were consistently sending more than the bus could handle, though I'd expect this to resolve itself after your application stopped transmitting at that rate.

Is your application sending a lot of traffic on this bus? Is it possible that you are sending in a while loop without a sleep?

I think something may be wrong there and the application is getting stuck in the "queue full" state. However, there are other things I should look in to if your application normally does not transmit much on this channel.

--Paul

tysonite commented 4 years ago

Thanks.

It is not resolving itself after our application stopped working. Only OrangePI reboot or icsscand restart helps.

There is about 115 000 CAN messages in total according to the log transferred within 2 hours in both directions. Not sure whether this much or not.

As well, when we are sending CAN messages out to the system under test from our application, after each message there is 100ms delay. However, this loop usually includes not more than 20-50 messages in a row. Than, there will be some post/pre steps that don't include excersing of the system under test with CAN messages. Only general periodic traffic is present.

Just for the sake of history. This started happening after we upgraded to the branch that includes additional logging for the checksum error. Before that we never observed such behavior with write: No buffer space available. Also, there were no noticeable changes in our application.

I am wondering if additional changes made on logging branch causes that, it looks like that there are unrelated to checksum error changes. BTW, 'checksum error` is not observed anymore, or it is hidden by a different issue.

tysonite commented 4 years ago

Some new observations, when issue being discussed above happens, we restart icsscand, after that, there are a lot of messages in syslog like:

Nov 22 03:56:26 icsscand[9755]: The libicsneo SocketCAN Usermode Daemon
Nov 22 03:56:26 icsscand[9755]: Copyright Intrepid Control Systems, Inc. 2019
Nov 22 03:56:26 icsscand[9755]: Daemon v2.0.2 checksum-failure-logging
Nov 22 03:56:26 icsscand[9755]: libicsneo v0.2.0 HEAD @
Nov 22 03:56:26 icsscand[9755]: Driver v2.0.2
Nov 22 03:56:26 icsscand[9755]: Waiting for connections...
Nov 22 03:56:26 icsscand[9755]: Message dropped, could not find the device the kernel referenced
...
Nov 22 03:56:26 kernel: [319906.887052] intrepid: Dropping message on can1, dropped by kernel

And

Nov 22 03:56:27 icsscand[9755]: message repeated 152 times: [ Message dropped, could not find the device the kernel referenced]

Before interfaces come up:

Nov 22 03:56:29 icsscand[9755]: Creating network interface hscan_121413
Nov 22 03:56:29 icsscand[9755]: Message dropped, could not find the device the kernel referenced
Nov 22 03:56:29 icsscand[9755]: message repeated 93 times: [ Message dropped, could not find the device the kernel referenced]
Nov 22 03:56:29 icsscand[9755]: Created network interface hscan_121413
Nov 22 03:56:29 icsscand[9755]: Creating network interface mscan_121413
Nov 22 03:56:29 icsscand[9755]: Message dropped, could not find the device the kernel referenced
Nov 22 03:56:29 icsscand[9755]: message repeated 96 times: [ Message dropped, could not find the device the kernel referenced]
Nov 22 03:56:29 icsscand[9755]: Created network interface mscan_121413
Nov 22 03:56:29 icsscand[9755]: ValueCAN 3 121413 connected
Nov 22 03:56:29 icsscand[9755]: Message dropped, could not find the device the kernel referenced

tysonite commented 4 years ago

Hi @hollinsky-intrepid, did you get a chance to look further?

hollinsky-intrepid commented 4 years ago

We are actively looking into the issue and running some tests to see if we can reproduce your issue here. Due to the fact that the errant behavior only appears after several days straight of running, it may take some time to reproduce and fix.

--Paul

tysonite commented 4 years ago

Thank you. If you need any other info, please feel free to ask. If it will be possible, I will share it.

tysonite commented 4 years ago

Few more logs, on one of our OrangePIs that sometimes experience issue, the following log records appeared. However, icsscand daemon continued to work properly.

Nov 27 11:05:51 systemd-journald[387]: Suppressed 13083 messages from /system.slice/icsscand.service
Nov 27 11:05:51 icsscand[9755]: Discarding byte 22
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 02
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte ac
Nov 27 11:05:51 icsscand[9755]: Discarding byte 04
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 01
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 03
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 10
Nov 27 11:05:51 icsscand[9755]: Discarding byte 40
Nov 27 11:05:51 icsscand[9755]: Discarding byte 1f
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 43
Nov 27 11:05:51 icsscand[9755]: Discarding byte 11
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 42
Nov 27 11:05:51 icsscand[9755]: Discarding byte 18
Nov 27 11:05:51 icsscand[9755]: Discarding byte 97
Nov 27 11:05:51 icsscand[9755]: Discarding byte 3d
Nov 27 11:05:51 icsscand[9755]: Discarding byte 4a
Nov 27 11:05:51 icsscand[9755]: Discarding byte 1c
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 01
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00
Nov 27 11:05:51 icsscand[9755]: Discarding byte 00

Also, ifconfig reports dropped packets:


can1      Link encap:UNSPEC  HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00
          UP RUNNING NOARP  MTU:72  Metric:1
          RX packets:8464235 errors:0 dropped:437 overruns:0 frame:0
          TX packets:3299428 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:10
          RX bytes:50615283 (50.6 MB)  TX bytes:8 (8.0 B)

hollinsky-intrepid commented 4 years ago

@tysonite I have still not been able to reproduce here. In many cases when I simulated an error though, these "discarding byte" messages were preceded by a line "A checksum error has occurred!" along with more information about the error. Did you see anything like that in your logs?

tysonite commented 4 years ago

@hollinsky-intrepid, "checksum error" message went away after we switched to branch with additional logs. Only "discarding bytes" present sometimes.

I already asked 2 times, but didn't receive a reply. It looks for me that checksum-failure-logging branch contains more changes than only additional logging comparing to master:

  > Checksum failure logging to stderr for debugging
  > Clarify FlexRay Slot ID
  > add CMake options to disable certain targets and fix a few compiler warnings
  > fix MSVC build errors and warnings
  > Quiet down warnings on Windows
  > Merge branch 'v0.2.0-dev' of github.com:intrepidcs/libicsneo into v0.2.0-dev
  > Begin work on FlexRay support
  > Resolve merge conflicts
  > Added more mutexes in eventmanager, fixed exiting destruction bug
  > Fixed wait check in device online/offline
  > fix building for macos
  > Reordered error downgrading, order of going offline, and renamed locks
  > Replaced formal parameter filter in addmessagecallback and addeventcallback
  > Changed getpollingmessagelimit to return an int
  > Added waits to ensure going online and offline succeed before returning
  > Added documentation for event callbacks
  > Testing successful removal of events from within event callbacks
  > Moved calling of event callbacks to after the event is added. Now using unique_locks on the list of events to allow event callbacks to safely modify the list of events
  > Added event callbacks multithreaded test and added documentation
  > Fixed runtime error casting placeholder void* to void in message and event callbacks
  > Added event callback functionality in icsneoc
  > Added event callback functionality. EventManager now uses multiple mutexes to lock events, errors, callbacks, and downgradedThreads separately. Wrote single-threaded test for event callbacks.
  > Added error flag when packetizer and multichannelcommunication fail to read
  > Added check to see if message polling is enabled before disabling it in device destructor
  > Removed cancellation of error downgrading after closing, was causing read access violations.
  > Removed radpluto override of setupDecoder() in order to fix timestamp
  > Added placeholder void* filter parameter to icsneo_addMessageCallback and added more documentation for message callback functions
  > Cleaned up icsneolegacy returns that use icsneo_settingsReadStructure
  > Added message callback functionality in c and canceled error downgrading on user callbacks in communication as well
  > Standardized int-returning functions in icsneoc library to return -1 on failure. Updated icsneolegacy accordingly, and added headers for message callback functionality in c
  > Error downgrading now uses a map<threadid, bool>, and error downgrading is canceled during message callbacks
  > Added flag BUILD_DOCS for documentation in CMakeLists.txt
  > Added overview section for API concepts and added more detailed explanation of message callbacks and polling
  > Ignoring _build
  > POSIX FTDI: Improve performance by setting the latency timer
  > Added official documentation for device concepts in usage
  > Wrote official documentation for the events and error system
  > Updated some documentation in eventmanager to reflect chronological pruning of events
  > Updated copyrights from 2018 to 2019
  > Updated library version numbers to 0.2.0
  > Added initial documentation files for c#, java, and usage
  > Added functionality to set blocking behavior of write
  > Device::GetMessages now returns a pair of {vector, bool} instead of vector
  > Downgraded non-user thread errors to warnings
  > Implemented thread specific error-downgrading to events and wrote corresponding unit test
  > Renamed test suite in CMakeLists.txt
  > Reverted refactor changes to icsneolegacy.h
  > Error checking for Encoder::encode UpdateLEDState
  > Revise documentation for icsneo_getLastError
  > Removed casts in network.h constants for SWIG compatibility
  > Cleaned up some formatting
  > Updated documentation, added multithreaded test with both events and errors for eventmanager
  > Events are now removed purely in chronological order if overflowing
  > Reordered tests
  > Added functionality for filtering by both type and severity. Wrote tests for eventmanager::get
  > Went back to normal mutex because I forgot I reverted those
  > Wrote more tests, fixed bug with eventmanager::get(size), optimized eventmanager adding and shrinking, prevented trying to add TooManyEvents
  > Revert "Changed list of events to vector, switched back to using lock_guard instead of shared/unique_locks"
  > Changed list of events to vector, switched back to using lock_guard instead of shared/unique_locks
  > Fixed multithreading issue with eventmanager
  > Wrote more tests and fixed locking bug in eventmanager
  > CMake properly uses -DBUILD_TESTS flag, added CmakeSettings.json to gitignore
  > Separated tests into different files, added ResetInstance() method to EventManager
  > Test framework tentatively working on windows
  > Reversed nullptr device check
  > Added gtest folder and building doesn't throw errors
  > Builds and runs on both windows and ubuntu
  > Updated README for refactor
  > Runs on windows
  > Event refactor builds on Windows
  > Implemented threaded error handling in errormanager and fixed compile error
  > Upgraded the severity of some warnings to errors and cleaned up internal API error adding
  > Initial rework of internal management for overflowing errors
  > C++ error descriptions now distinguish between severity levels
  > icsneolegacy icsneoOpenNeoDevice return now checks success for setting polling message limit and enabling message polling too
  > Added error checking for polling
  > Added error checking to Device::getMessages()
  > Added additional documentation
  > Reworked updating LEDState to use sendCommand() instead of transmit()
  > Added icsneo_isMessagePollingEnabled(), removed polling check from transmitting, and disabled error reporting on device destruction
  > Fix for error at pcap and vcp level on Windows causes crash
  > Removed extraneous error checking
  > Finished adding error checking
  > Added more error logging
  > Commands can now only be sent when the device is open
  > Transmit and receiving messages now fails properly when the device is closed. Device closes when serial number initialization fails after open.
  > Added error checking to functions called from Communication
  > Added error reporting to idevicesettings
  > Added error checking and removed some redundancy from device isOpen()
  > Added errors for DeviceCurrentlyOpen and DeviceCurrentlyClosed
  > Added .vs to gitignore
  > Added icsneo_isOpen() functionality
  > Fixed typo in documentation of icsneo_getLastError
  > Removed hard coded .dll filepaths
  > Fixed icsneoc.h to work with c
  > Lock before adding or removing message callbacks

Can you please double check if all this really necessary, and we are really running same versions on both sides? I feel we are testing some changes that are still in development.

That's how we switched to checksum-failure-logging branch:

git clone --recursive https://github.com/intrepidcs/icsscand.git icsscand
cd icsscand && git checkout checksum-failure-logging && git submodule update --recursive

Anything we did wrong?

tysonite commented 4 years ago

@hollinsky-intrepid Any feedback on previous message?

hollinsky-intrepid commented 4 years ago

Hi @tysonite,

I see the line Nov 27 11:05:51 systemd-journald[387]: Suppressed 13083 messages from /system.slice/icsscand.service, this is journald hitting the rate limiting threshold and dropping log messages. These are log messages that we need, since they will tell us about the checksum error.

You can remove/change the rate limiting on journald:

# in /etc/systemd/journald.conf

RateLimitInterval=0
RateLimitBurst=0

$ systemctl restart systemd-journald

(see more here)

For system performance reasons, once you capture the logs we need, I'd recommend setting this back to the default.

After this, all messages should be collected and we can look further.

About the new version: checksum-error-logging is based on v0.2.0-dev. The only reason it has not been merged into master is that we've made some breaking API changes, and would like to make any breaking API changes at once. The branch is stable and we use it internally. We had made improvements to the checksum error handling on v0.2.0-dev (such that it does not just say "An unknown internal error occurred" like master) hence using it as the base for your checksum-error-logging branch.

tysonite commented 4 years ago

Hi @hollinsky-intrepid,

Thank you for clarifications and nice catch of systemd logs rate limit. We've also found some inconstencies how our system used by consumers, and trying to prevent this in future as well as testing defense mechanism. In short, periodic messages being sent by our system never terminate, new periodic messages queue in addition, and this probably lead to malfunction of icsscand after some period of time. We plan to fix this misbehavior at first, enable systemd logs for icsscand and wait for reproduction if it happens.

One more question, if you don't mind, there is so called "off-bus" of the CAN hardware. And recovery for this state can be enabled by sudo ip link set canX type can restart-ms 100. Is it supported by your driver/icsscand?

hollinsky-intrepid commented 4 years ago

Hi @tysonite,

That sounds all sounds good, let us know if you encounter any other problems once your end has the fix applied.

Our hardware is programmed to avoid bus-off. Therefore, recovery will never be necessary. The TEC/REC are exposed to the user currently as a CAN message, though I'd like to make a better API for this in the near future.

--Paul

tysonite commented 4 years ago

Hi @hollinsky-intrepid,

After some more experiments I come to the conclusion that there is something weird happening with a ValueCAN3 device when it is used for a long amount of time. This method does not exactly reproduce complete freeze of icsscand daemon, but at least gives a similar error while sending CAN messages continiously. Please see my finding below.

Hereafter a Python 2.7 script that after some time produces Failed to transmit: [Errno 105] No buffer space available error on the standard output.

#!/usr/bin/env python

from __future__ import print_function
import time
import threading

import can

try:
    bus = can.ThreadSafeBus(interface='socketcan', channel='ics0can1')
except:
    bus = can.ThreadSafeBus(interface='socketcan', channel='can1')

def send_periodic():

    msg = can.Message(arbitration_id=0x122, data=[0x04, 0x02, 0, 0, 0, 0, 0, 0])
    bus.send_periodic(msg, float(0.1))

    print("periodic thread")
    for i in range(100):
        msg = can.Message(arbitration_id=i,
                          data=[0, 25, 0, 1, 3, 1, 4, 1])
        bus.send_periodic(msg, float(0.1))
    print("periodic thread done")

def par_thread():
    print("send thread")
    def func():
        msg = can.Message(arbitration_id=0x999, data=[0x04, 0x02, 0, 0, 0, 0, 0, 0])
        i = 0
        while True:
            try:
                i = i + 1
                if i % 100 == 0:
                    print("i = {}".format(i))
                bus.send(msg, 0.1)
                time.sleep(0.5)
            except Exception as e:
                print("{} - {} - {}".format(threading.active_count(), i, e))

    thread = threading.Thread(target = func)
    thread.start()
    print("send thread started")

if __name__ == '__main__':
    par_thread()
    send_periodic()
    print("sending messages")
    time.sleep(10000)

In order to run it, you need to have python-can package installed:

pip install python-can==3.3.2

This script does creates a huge load that is not a real case in our production environment, but at least it moves us a bit towards issue understanding (hopefully). It sends periodic messages (100) using SocketCAN Linux native interface (bus.send_periodic) and sends in parallel another CAN message (bus.send).

After about 5-10 minutes, it starts printing 2 - 3692 - Failed to transmit: [Errno 105] No buffer space available messages, but recovers after few seconds, then starts printing them again and so. This is probably expected as txqueuelen is equal to 10 on can1 interface.

While script runs, I observe TX/RX counters using ip command (all 0 are good in this example as driver was restarted):

$ ip -details -statistics link show can1
7: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 0
    can state STOPPED restart-ms 0
          clock 0
          re-started bus-errors arbit-lost error-warn error-pass bus-off
          0          0          0          0          0          0
    RX: bytes  packets  errors  dropped overrun mcast
    0          0        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    0          0        0       0       0       0

At the very first run, they are increasing fast enough, but over time, and after several restarts of the script, the speed that are increasing highly degrades. So finally they increase by 10-20 per second while at the begging it was around 300-400.

Even if OrangePI is rebooted to exclude any potential issues in the CAN drivers/daemon and to free up any stale resources, the counters are still increasing very slowly and this makes me think that ValueCAN3 is busy with some unexpected tasks.

After some time (around 40-60 minutes), ValueCAN3 gets back to normal. The same happens on any CAN driver/daemon (v1 or v2). No any errors reported by icsscand on stdout/stderr, or ip command.

I would like to get your thoughts/suggestions on this as I feel that probably the same kind of stuff happening in our production environment, but as it runs for a very long time (days / weeks), the circumstances look differently.

tysonite commented 4 years ago

Hi @hollinsky-intrepid,

I am able to reproduce an issue with write: No buffer space available using script below. It takes around 48 hours (or probably less) as I kept it running during weekend. I feel that this is something in the software as after issue is reproduced, the system didn't come up even if I wait for a reasonable period of time, icsscand daemon does not consume much CPU (usually, it consumes about 20% in idle mode when no CAN traffic) when this is observed.

The topology of network is following: OrangePI connected to ValueCAN3/4, and another device is connected to the same ValueCAN3/4 that produces about 5-8 periodic messages with 0.2 seconds period. There are 2 interfaces (can0, can1), can1 has traffic, can0 has no at all.

The only recovery is to restart icsscand daemon. Nothing in logs. If you want me to provide some info for debugging, let me know please. I'll keep it in this error state.

#!/usr/bin/env python

from __future__ import print_function
import time
import threading

import can

try:
    bus = can.ThreadSafeBus(interface='socketcan', channel='ics0can1')
except:
    bus = can.ThreadSafeBus(interface='socketcan', channel='can1')

def send_periodic():

    msg = can.Message(arbitration_id=0x122, data=[0x04, 0x02, 0, 0, 0, 0, 0, 0])
    bus.send_periodic(msg, float(0.1))

    print("periodic thread")
    for i in range(80):
        msg = can.Message(arbitration_id=i,
                          data=[0, 25, 0, 1, 3, 1, 4, 1])
        bus.send_periodic(msg, float(0.1))
    print("periodic thread done")

def send_msg(msg):
    retries = 5
    last_exception = RuntimeError('send_msg failed')
    while retries:
        try:
            bus.send(msg, 0.1)
            return
        except can.CanError as e:
            last_exception = e
            if retries > 0:
                sleep_time = 5 - (retries * 0.5)
                print("CanError: {}, {} retries left. sleeping for {}".format(e, retries, sleep_time))
                time.sleep(sleep_time)
            retries = retries - 1
    raise last_exception

def par_thread():
    print("send thread")
    def func():
        msg = can.Message(arbitration_id=0x999, data=[0x04, 0x02, 0, 0, 0, 0, 0, 0])
        i = 0
        while True:
            try:
                i = i + 1
                if i % 100 == 0:
                    print("i = {}".format(i))
                send_msg(msg)
                time.sleep(0.5)
            except Exception as e:
                print("{} - {} - {}".format(threading.active_count(), i, e))

    thread = threading.Thread(target = func)
    thread.start()
    print("send thread started")

if __name__ == '__main__':
    par_thread()
    send_periodic()
    print("sending messages")
    time.sleep(10000)

hollinsky-intrepid commented 4 years ago

Very peculiar. I left mine running over the weekend on my workstation as well and have 242704818 packets sent thus far without error. There are no logs in the Python console other than from i = 0 to i = 479700. I'm running on the checksum-failure-logging branch and kernel 5.3.0-24.

I also have yet to observe the CPU usage you mentioned before, so I'm starting to suspect whether the OrangePi specific kernel (or possibly just older kernel) is doing something different. It is also curious since you are only loading the bus to ~30%, assuming a 500k bus.

tysonite commented 4 years ago

ValueCAN is configured with 125k can1 bus. However, the same happens on a very small amount of traffic (7%-10% bus load) The kernel version is Linux 4.13.15-sunxi #1 SMP Tue Nov 21 23:35:46 MSK 2017 armv7l armv7l armv7l GNU/Linux

I've following statistics now:

$ cat /proc/net/can/stats

 24648205 transmitted frames (TXF)
 25448625 received frames (RXF)
 25443417 matched frames (RXMF)

       99 % total match ratio (RXMR)
      117 frames/s total tx rate (TXR)
      121 frames/s total rx rate (RXR)

      100 % current match ratio (CRXMR)
        0 frames/s current tx rate (CTXR)
        0 frames/s current rx rate (CRXR)

      211 % max match ratio (MRXMR)
      894 frames/s max tx rate (MTXR)
     1707 frames/s max rx rate (MRXR)

        0 current receive list entries (CRCV)
        4 maximum receive list entries (MRCV)

        5 statistic resets (STR)
        3 user statistic resets (USTR)

$ ip -details -statistics link show can1
13: can1: <NOARP,UP,LOWER_UP,ECHO> mtu 72 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
    link/can  promiscuity 0
    can state STOPPED restart-ms 0
          clock 0
          re-started bus-errors arbit-lost error-warn error-pass bus-off
          0          0          0          0          0          0
    RX: bytes  packets  errors  dropped overrun mcast
    168119960  21264165 0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    8          20589948 0       0       0       0

When it reproduces, script prints continuously:

CanError: Failed to transmit: [Errno 105] No buffer space available, 5 retries left. sleeping for 2.5
CanError: Failed to transmit: [Errno 105] No buffer space available, 4 retries left. sleeping for 3.0
CanError: Failed to transmit: [Errno 105] No buffer space available, 3 retries left. sleeping for 3.5
CanError: Failed to transmit: [Errno 105] No buffer space available, 2 retries left. sleeping for 4.0
CanError: Failed to transmit: [Errno 105] No buffer space available, 1 retries left. sleeping for 4.5
2 - 58620 - Failed to transmit: [Errno 105] No buffer space available

As well, cansend fails with the same error.

hollinsky-intrepid commented 4 years ago

I think we will try to get the same hardware setup that you have in order to track down the issue further.

You're using the OrangePi One with 512MB of RAM? And what type/speed of SD card are you using for it?

tysonite commented 4 years ago

Are you sure? We have several other devices connected via different interfaces, it will be hard to mimic our setup.

We are actually open to have call for a live debugging session. Will it work?

tysonite commented 4 years ago

You're using the OrangePi One with 512MB of RAM? And what type/speed of SD card are you using for it?

Yes, OrangePI One with 512Mb of RAM. I will share SD card picture soon as well as pre-built image to flash on it as my team saying that this is possible to share it without signed NDA.

tysonite commented 4 years ago

The SD card is this one: https://www.amazon.com/SanDisk-Extreme-MicroSDXC-UHS-I-Adapter/dp/B07G3GMRYF/ref=sr_1_3?keywords=sandisk+64gb+extreme+pro&qid=1576681133&sr=8-3

And few more updates:

On Ubuntu 18.04, running Linux 5.0.0-37-generic #40~18.04.1-Ubuntu SMP Thu Nov 14 12:06:39 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux, the issue is not reproducing for about 2 days already. SUT is connected.
On OrangePI, the issue is not reproducing for a couple of days when SUT (system under test) is not connected to the same BUS.

So I suspect that this is combination of 2 reasons:

Linux 4.13.15
SUT connected to the same BUS that produces some CAN traffic as well

tysonite commented 4 years ago

Hi @hollinsky-intrepid, did you get a chance to reproduce this on OPI? Anything we could do to help?

hollinsky-intrepid commented 4 years ago

Hi @tysonite,

I didn't expect it to take quite this long to get hardware. In the meantime, could you get us a core dump where it's stuck? You can do gcore [pid] to get one.

Perhaps you could send this, along with the binaries you've built, to the support team so we can take a closer look while we're waiting for ours to arrive and exhibit the behavior you're talking about.

--Paul

tysonite commented 4 years ago

I've sent data through David. The icsscand daemon was in a state when it rejects CAN messages and does not recover automatically after a while assuming high bus load prior to No buffer space available.

tysonite commented 4 years ago

Hi @hollinsky-intrepid,

I am sad to say, but we can't replicate the issue for your reproduction using just OrangePI. However, the icsscand daemon still freezes in our regular environment, and can be back to normal only when restarted.
We've run icsscand daemon through valgrind, and results are available here. If there is some flakiness in memory access, it could potentially explain this weird behavior. Would you please take a look?
I noted that you are using lock-free concurrentqueue 3rd party, are there real benefits comparing to all disadvantages? icsscand just has single consumer-producer threads. And this library received some fixes (e.g. this) at the end of 2019, however, even if it is updated, the "freeze" happens.

tysonite commented 4 years ago

Here is one more backtrace with debug info included:

[New LWP 28279]
[New LWP 28284]
[New LWP 28285]
[New LWP 28286]
[New LWP 28287]
[New LWP 28288]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `/opt/test-automation/ta-node-icsscand/icsscand'.
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
[Current thread is 1 (Thread 0xb6f56000 (LWP 28279))]

Thread 6 (Thread 0xb44ff450 (LWP 28288)):
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0xb6ed89a8 in futex_abstimed_wait_cancelable (private=0, abstime=0xb44fecb4, expected=1, futex_word=0xb5f41724) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#2  do_futex_wait (sem=sem@entry=0xb5f41724, abstime=abstime@entry=0xb44fecb4) at sem_waitcommon.c:115
#3  0xb6ed8b64 in __new_sem_wait_slow (sem=0xb5f41724, abstime=0xb44fecb4) at sem_waitcommon.c:282
#4  0xb6ed8bd4 in sem_timedwait (sem=<optimized out>, abstime=<optimized out>) at sem_timedwait.c:36
#5  0x00134310 in moodycamel::details::Semaphore::timed_wait (this=0xb5f41724, usecs=100000) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/lightweightsemaphore.h:224
#6  0x0013b156 in moodycamel::LightweightSemaphore::waitWithPartialSpinning (this=0xb5f41720, timeout_usecs=100000) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/lightweightsemaphore.h:280
#7  0x0013b25a in moodycamel::LightweightSemaphore::wait (this=0xb5f41720, timeout_usecs=100000) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/lightweightsemaphore.h:362
#8  0x0013cb6e in moodycamel::BlockingConcurrentQueue<icsneo::ICommunication::WriteOperation, moodycamel::ConcurrentQueueDefaultTraits>::wait_dequeue_timed<icsneo::ICommunication::WriteOperation> (this=0xb5f16324, item=..., timeout_usecs=100000) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/blockingconcurrentqueue.h:378
#9  0x0013c006 in moodycamel::BlockingConcurrentQueue<icsneo::ICommunication::WriteOperation, moodycamel::ConcurrentQueueDefaultTraits>::wait_dequeue_timed<icsneo::ICommunication::WriteOperation, long long, std::ratio<1ll, 1000ll> > (this=0xb5f16324, item=..., timeout=...) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/blockingconcurrentqueue.h:394
#10 0x001427cc in icsneo::FTDI::writeTask (this=0xb5f161c8) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/platform/posix/ftdi.cpp:196
#11 0x001450fa in std::_Mem_fn_base<void (icsneo::FTDI::*)(), true>::operator()<, void>(icsneo::FTDI*) const (this=0xb5f48414, __object=0xb5f161c8) at /usr/include/c++/5/functional:600
#12 0x001450b2 in std::_Bind_simple<std::_Mem_fn<void (icsneo::FTDI::*)()> (icsneo::FTDI*)>::_M_invoke<0u>(std::_Index_tuple<0u>) (this=0xb5f48410) at /usr/include/c++/5/functional:1531
#13 0x00144fbc in std::_Bind_simple<std::_Mem_fn<void (icsneo::FTDI::*)()> (icsneo::FTDI*)>::operator()() (this=0xb5f48410) at /usr/include/c++/5/functional:1520
#14 0x00144f2a in std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (icsneo::FTDI::*)()> (icsneo::FTDI*)> >::_M_run() (this=0xb5f48404) at /usr/include/c++/5/thread:115
#15 0xb6e4adc8 in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#16 0xb6ed25b4 in start_thread (arg=0x0) at pthread_create.c:335
#17 0xb6d3dc4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 5 (Thread 0xb4cff450 (LWP 28287)):
#0  0xb6d37bf2 in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0xb6ef7334 in poll (__timeout=60000, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/arm-linux-gnueabihf/bits/poll2.h:46
#2  handle_events (ctx=ctx@entry=0xb5f290f0, tv=tv@entry=0xb4cfec94) at ../../libusb/io.c:2103
#3  0xb6ef7db6 in libusb_handle_events_timeout_completed (ctx=0xb5f290f0, tv=tv@entry=0xb4cfeccc, completed=0xb4cfed10) at ../../libusb/io.c:2285
#4  0xb6ef7ea8 in libusb_handle_events_completed (ctx=<optimized out>, completed=<optimized out>) at ../../libusb/io.c:2384
#5  0xb6ef84d8 in sync_transfer_wait_for_completion (transfer=transfer@entry=0xb4e06050) at ../../libusb/sync.c:51
#6  0xb6ef8572 in do_sync_bulk_transfer (dev_handle=0xb5f11d20, endpoint=<optimized out>, buffer=0xb5f20bd0 "1`\252\f ", length=4096, transferred=transferred@entry=0xb4cfed7c, timeout=timeout@entry=100, type=type@entry=2 '\002') at ../../libusb/sync.c:180
#7  0xb6ef87a4 in libusb_bulk_transfer (dev_handle=<optimized out>, endpoint=<optimized out>, data=<optimized out>, length=<optimized out>, transferred=0xb4cfed7c, timeout=100) at ../../libusb/sync.c:257
#8  0x0015fde0 in ftdi_read_data (ftdi=0xb5f28fc0, buf=0xb4cfeddc "b\002", size=8) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/third-party/libftdi/src/ftdi.c:1948
#9  0x00142b5a in icsneo::FTDI::FTDIContext::read (this=0xb5f164bc, data=0xb4cfeddc "b\002", size=8) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/platform/posix/ftdi.h:41
#10 0x0014270e in icsneo::FTDI::readTask (this=0xb5f161c8) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/platform/posix/ftdi.cpp:184
#11 0x001450fa in std::_Mem_fn_base<void (icsneo::FTDI::*)(), true>::operator()<, void>(icsneo::FTDI*) const (this=0xb5f290e4, __object=0xb5f161c8) at /usr/include/c++/5/functional:600
#12 0x001450b2 in std::_Bind_simple<std::_Mem_fn<void (icsneo::FTDI::*)()> (icsneo::FTDI*)>::_M_invoke<0u>(std::_Index_tuple<0u>) (this=0xb5f290e0) at /usr/include/c++/5/functional:1531
#13 0x00144fbc in std::_Bind_simple<std::_Mem_fn<void (icsneo::FTDI::*)()> (icsneo::FTDI*)>::operator()() (this=0xb5f290e0) at /usr/include/c++/5/functional:1520
#14 0x00144f2a in std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (icsneo::FTDI::*)()> (icsneo::FTDI*)> >::_M_run() (this=0xb5f290d4) at /usr/include/c++/5/thread:115
#15 0xb6e4adc8 in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#16 0xb6ed25b4 in start_thread (arg=0x0) at pthread_create.c:335
#17 0xb6d3dc4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 4 (Thread 0xb56ff450 (LWP 28286)):
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0xb6ed89a8 in futex_abstimed_wait_cancelable (private=0, abstime=0xb56fec4c, expected=1, futex_word=0xb5f4170c) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
#2  do_futex_wait (sem=sem@entry=0xb5f4170c, abstime=abstime@entry=0xb56fec4c) at sem_waitcommon.c:115
#3  0xb6ed8b64 in __new_sem_wait_slow (sem=0xb5f4170c, abstime=0xb56fec4c) at sem_waitcommon.c:282
#4  0xb6ed8bd4 in sem_timedwait (sem=<optimized out>, abstime=<optimized out>) at sem_timedwait.c:36
#5  0x00134310 in moodycamel::details::Semaphore::timed_wait (this=0xb5f4170c, usecs=100000) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/lightweightsemaphore.h:224
#6  0x001344f6 in moodycamel::LightweightSemaphore::waitManyWithPartialSpinning (this=0xb5f41708, max=4, timeout_usecs=100000) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/lightweightsemaphore.h:321
#7  0x0013480a in moodycamel::LightweightSemaphore::waitMany (this=0xb5f41708, max=4, timeout_usecs=100000) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/lightweightsemaphore.h:385
#8  0x0015356a in moodycamel::BlockingConcurrentQueue<unsigned char, moodycamel::ConcurrentQueueDefaultTraits>::wait_dequeue_bulk_timed<unsigned char*&> (this=0xb5f161e4, itemFirst=@0xb56fed40: 0xb4d00d30 "", max=4, timeout_usecs=100000) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/blockingconcurrentqueue.h:467
#9  0x00153478 in moodycamel::BlockingConcurrentQueue<unsigned char, moodycamel::ConcurrentQueueDefaultTraits>::wait_dequeue_bulk_timed<unsigned char*, long long, std::ratio<1ll, 1000ll> > (this=0xb5f161e4, itemFirst=0xb4d00d30 "", max=4, timeout=...) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/third-party/concurrentqueue/blockingconcurrentqueue.h:482
#10 0x001530b4 in icsneo::ICommunication::readWait (this=0xb5f161c8, bytes=std::vector of length 4, capacity 7152 = {...}, timeout=..., limit=4) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/communication/icommunication.cpp:34
#11 0x00114de6 in icsneo::Communication::readTask (this=0xb5f11df4) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/communication/communication.cpp:167
#12 0x0011b492 in std::_Mem_fn_base<void (icsneo::Communication::*)(), true>::operator()<, void>(icsneo::Communication*) const (this=0xb5f1827c, __object=0xb5f11df4) at /usr/include/c++/5/functional:600
#13 0x0011b44a in std::_Bind_simple<std::_Mem_fn<void (icsneo::Communication::*)()> (icsneo::Communication*)>::_M_invoke<0u>(std::_Index_tuple<0u>) (this=0xb5f18278) at /usr/include/c++/5/functional:1531
#14 0x0011b2ec in std::_Bind_simple<std::_Mem_fn<void (icsneo::Communication::*)()> (icsneo::Communication*)>::operator()() (this=0xb5f18278) at /usr/include/c++/5/functional:1520
#15 0x0011b22a in std::thread::_Impl<std::_Bind_simple<std::_Mem_fn<void (icsneo::Communication::*)()> (icsneo::Communication*)> >::_M_run() (this=0xb5f1826c) at /usr/include/c++/5/thread:115
#16 0xb6e4adc8 in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#17 0xb6ed25b4 in start_thread (arg=0x0) at pthread_create.c:335
#18 0xb6d3dc4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 3 (Thread 0xb5eff450 (LWP 28285)):
#0  0xb6d37bf0 in poll () at ../sysdeps/unix/syscall-template.S:84
#1  0xb6efbf4c in poll (__timeout=-1, __nfds=2, __fds=0xb5efee34) at /usr/include/arm-linux-gnueabihf/bits/poll2.h:46
#2  linux_udev_event_thread_main (arg=<optimized out>) at ../../libusb/os/linux_udev.c:175
#3  0xb6ed25b4 in start_thread (arg=0x0) at pthread_create.c:335
#4  0xb6d3dc4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 2 (Thread 0xb6806450 (LWP 28284)):
#0  0xb6ed9dbe in nanosleep () at ../sysdeps/unix/syscall-template.S:84
#1  0x001039ec in std::this_thread::sleep_for<long long, std::ratio<1ll, 1000ll> > (__rtime=...) at /usr/include/c++/5/thread:292
#2  0x000fe61c in deviceSearchThread () at /home/orangepi/can_issue/icsscand-source/src/main.cpp:341
#3  0x0010a6b8 in std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (this=0x2069f68) at /usr/include/c++/5/functional:1531
#4  0x0010a5b0 in std::_Bind_simple<void (*())()>::operator()() (this=0x2069f68) at /usr/include/c++/5/functional:1520
#5  0x0010a47e in std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (this=0x2069f5c) at /usr/include/c++/5/thread:115
#6  0xb6e4adc8 in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#7  0xb6ed25b4 in start_thread (arg=0x0) at pthread_create.c:335
#8  0xb6d3dc4c in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:89 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Thread 1 (Thread 0xb6f56000 (LWP 28279)):
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0xb6ed6dba in __pthread_cond_wait (cond=0xb5f16480, mutex=0xb5f16464) at pthread_cond_wait.c:186
#2  0xb6e475a0 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#3  0x00153162 in icsneo::ICommunication::write (this=0xb5f161c8, bytes=std::vector of length 22, capacity 25 = {...}) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/communication/icommunication.cpp:50
#4  0x001158b2 in icsneo::Communication::rawWrite (this=0xb5f11df4, bytes=std::vector of length 22, capacity 25 = {...}) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/include/icsneo/communication/communication.h:38
#5  0x0011455a in icsneo::Communication::sendPacket (this=0xb5f11df4, bytes=std::vector of length 22, capacity 25 = {...}) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/communication/communication.cpp:58
#6  0x00133014 in icsneo::Device::transmit (this=0xb5f3ed68, message=std::shared_ptr (count 2, weak 0) 0x206a040) at /home/orangepi/can_issue/icsscand-source/third-party/libicsneo/device/device.cpp:300
#7  0x000ff3ba in main (argc=1, argv=0xbefc3dd4) at /home/orangepi/can_issue/icsscand-source/src/main.cpp:490

I feel that the issue is in thread 1 and thread 6 that probably deadlocks each other. The thread 1 tries to write a message in icsneo::ICommunication::write (this=0xb5f161c8, bytes=std::vector of length 22, capacity 25 = {...}), but locks waiting on a conditional variables. At the same time thread 6 times out in a loop in writeQueue.wait_dequeue_timed as probably writeQueue was empty at a time of message arrival, and FTDI::onWrite() method never called that leads to deadlock as conditional variable is not notified. Does it make sense?

hollinsky-intrepid commented 4 years ago

Hi @tysonite

I think you may well be correct in your analysis.

The valgrind results you posted are quite odd to me, since if you look at EventManager::isDowngradingErrorsOnCurrentThread you'll see it's just looking for a bool in an std::map. There could be a bug in the version of libstdc++ you're using, but I don't see anything wrong with that code. It may also be a false positive/red herring on valgrind's part.

I can definitely see what your second post is talking about with the condition variables. I suppose I had not expected that size_approx() would be so far off, but it's not impossible in high system load situations. Maybe your OrangePi is seeing a spike in system load that is throwing everything off there.

I want to push a fix for this and see where we end up. My fix will be two-fold. onWrite should be called even if nothing was dequeued, and the infinite wait should become a wait_for inside of a while loop so that we recover even if the CV fails for any reason.

As to your question of why we decided on concurrentqueue, we've had a great experience with it so far and have found it to be very performant. It doesn't make much of a difference with a ValueCAN, but some devices can accept megabits (or even gigabits) of data with networks like Ethernet and FlexRay so the performance makes a big difference. We may also decide to use thread-pools for decoding if necessary, making multi-producer/consumer a nice feature to have waiting in the wings. If we do find it to be a source of problems, it would not be too difficult to swap it out for another solution or something homegrown that performs a similar task.

tysonite commented 4 years ago

I want to push a fix for this and see where we end up. My fix will be two-fold. onWrite should be called even if nothing was dequeued, and the infinite wait should become a wait_for inside of a while loop so that we recover even if the CV fails for any reason.

I did it a bit simpler for the sake of prove during our testing. The blocking logic you have with a conditional variable might be replaced with try_enqueue in a loop. And onWrite is not required at at all in that case as synchronization takes place on queue itself.

See this patch:

diff --git a/communication/icommunication.cpp b/communication/icommunication.cpp
index 9b1680a..ead1b0a 100644
--- a/communication/icommunication.cpp
+++ b/communication/icommunication.cpp
@@ -45,19 +45,18 @@ bool ICommunication::write(const std::vector<uint8_t>& bytes) {
    }

    if(writeBlocks) {
-       std::unique_lock<std::mutex> lk(writeMutex);
-       if(writeQueue.size_approx() > writeQueueSize)
-           writeCV.wait(lk);
+       while (!writeQueue.try_enqueue(WriteOperation(bytes))) {}
    } else {
        if(writeQueue.size_approx() > writeQueueSize) {
            report(APIEvent::Type::TransmitBufferFull, APIEvent::Severity::Error);
            return false;
+       } else {
+           bool ret = writeQueue.enqueue(WriteOperation(bytes));
+           if(!ret)
+               report(APIEvent::Type::Unknown, APIEvent::Severity::Error);
+
+           return ret;
        }
    }
-
-   bool ret = writeQueue.enqueue(WriteOperation(bytes));
-   if(!ret)
-       report(APIEvent::Type::Unknown, APIEvent::Severity::Error);
-
-   return ret;
+   return true;
 }
\ No newline at end of file
diff --git a/platform/posix/ftdi.cpp b/platform/posix/ftdi.cpp
index 5f29fd6..d920f89 100644
--- a/platform/posix/ftdi.cpp
+++ b/platform/posix/ftdi.cpp
@@ -195,6 +195,5 @@ void FTDI::writeTask() {
            continue;

        ftdi.write(writeOp.bytes.data(), (int)writeOp.bytes.size());
-       onWrite();
    }
 }
\ No newline at end of file

It will be running for a few days on our side to get confidence that issue goes away unless you have better patch.

BTW, enqueue allocates memory in case queue is full while try_enqueue not. It will be feasible to replace it as you anyway limit the size of the queue. And of course better to get rid of size_approx, and rely only on queue internal limit.

As to your question of why we decided on concurrentqueue, we've had a great experience with it so far and have found it to be very performant. It doesn't make much of a difference with a ValueCAN, but some devices can accept megabits (or even gigabits) of data with networks like Ethernet and FlexRay so the performance makes a big difference. We may also decide to use thread-pools for decoding if necessary, making multi-producer/consumer a nice feature to have waiting in the wings. If we do find it to be a source of problems, it would not be too difficult to swap it out for another solution or something homegrown that performs a similar task.

I see, understood. The queue implementation has also to be upgraded as it contains known issue I noted earlier. It can happen according to original author in case of long running queue/dequeue operations.

hollinsky-intrepid commented 4 years ago

Okay that sounds good. I have a patch here I planned to give you which still uses the condition_variable to prevent excessive spinning, but we can wait on that and check that your solution works. Then we can apply it to v0.2.0 and get you guys back onto the mainline.

We'll also get the concurrentqueue implementation updated to the latest.

tysonite commented 4 years ago

Runs good so far. However, I forgot to note that I've added additional logging of USB errors via following patch as noted on libftdi help page:

diff --git a/platform/posix/ftdi.cpp b/platform/posix/ftdi.cpp
index 5f29fd6..49f3911 100644
--- a/platform/posix/ftdi.cpp
+++ b/platform/posix/ftdi.cpp
@@ -184,6 +184,8 @@ void FTDI::readTask() {
                auto readBytes = ftdi.read(readbuf, READ_BUFFER_SIZE);
                if(readBytes > 0)
                        readQueue.enqueue_bulk(readbuf, readBytes);
+                if(readBytes < 0)
+                        report(APIEvent::Type::Unknown, APIEvent::Severity::Error);
        }
 }

@@ -194,7 +196,8 @@ void FTDI::writeTask() {
                if(!writeQueue.wait_dequeue_timed(writeOp, std::chrono::milliseconds(100)))
                        continue;

-               ftdi.write(writeOp.bytes.data(), (int)writeOp.bytes.size());
-               onWrite();
+               auto writeBytes = ftdi.write(writeOp.bytes.data(), (int)writeOp.bytes.size());
+                if (writeBytes < 0)
+                         report(APIEvent::Type::Unknown, APIEvent::Severity::Error);
        }
-}
\ No newline at end of file
+}

During last 12 hours such error appeared in the log:

Feb 04 11:52:54 ptc_lab_collab3_node15 icsscand[8398]: ValueCAN 3 132830 connected
Feb 04 16:01:46 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 62
Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 02
Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00
Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 5a
Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00
Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00
Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00
Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00
Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: ValueCAN 3 132830 Warning: An unknown internal error occurred

But I am not sure from where it comes from as there are several places where unknown error may be raised up. Would it be also possible to somehow enhance logging like adding a separate category/type for USB errors as well as return code of ftdi read/write call? Also, would it make sense to retry ftdi calls in case of some errors as I am a bit worried about missed messages?

tysonite commented 4 years ago

The valgrind results you posted are quite odd to me, since if you look at EventManager::isDowngradingErrorsOnCurrentThread you'll see it's just looking for a bool in an std::map. There could be a bug in the version of libstdc++ you're using, but I don't see anything wrong with that code. It may also be a false positive/red herring on valgrind's part.

I don't think it is false positive. EventManager destructor does not protect downgradedThreads from the concurrent access, but it de-allocates related memory. This could the root cause of the issue. As well, destructing variable is not protected somehow, either atomic, but can be accessed by multiple threads at the same time. But not sure about this as I didn't do C++ programming for a long time, and things could be changed in modern standards.

tysonite commented 4 years ago

Okay that sounds good. I have a patch here I planned to give you which still uses the condition_variable to prevent excessive spinning, but we can wait on that and check that your solution works. Then we can apply it to v0.2.0 and get you guys back onto the mainline.

I would still propose to get rid of size_approx(), enqueue() and conditional variable to wake up a thread in order to make logic less complex and error-prone. Regarding excessive spinning, would it help if while (!writeQueue.try_enqueue(WriteOperation(bytes))) {} will contain a small delay, say 10 millisecs between tries?

The write function can be as simple as:

bool ICommunication::write(const std::vector<uint8_t>& bytes) {
        if(!isOpen()) {
                report(APIEvent::Type::DeviceCurrentlyClosed, APIEvent::Severity::Error);
                return false;
        }

        if(writeBlocks) {
                while (!writeQueue.try_enqueue(WriteOperation(bytes))) {
            std::this_thread::sleep_for(std::chrono::milliseconds(10));
        }
        } else {
        if (!writeQueue.try_enqueue(WriteOperation(bytes))) {
                        report(APIEvent::Type::TransmitBufferFull, APIEvent::Severity::Error);
                        return false;
                }
        }

        return true;
}

Even the daemon responsiveness might be higher in this case.

tysonite commented 4 years ago

The ValueCAN 3 132830 Warning: An unknown internal error occurred happened 3 more times. What is the best way to debug that? Assuming, the error comes from libftdi, I would say it is needed to print out error code into log or use ftdi_get_error_string, but I don't see a way to do that as there is no such API in libicsneo. Any suggestions?

Also, tried to google, and people try to invoke ftdi_read_data continuously to get rid of some errors, e.g. like here. But not sure if this is a case now.

Sorry for spamming, but reliability of icsscand is really critical for us.

hollinsky-intrepid commented 4 years ago

will contain a small delay, say 10 millisecs between tries

Unfortunately this would drastically affect latency. For some applications, round trip latency is extremely important, and the current system allows for latency in the hundreds of microseconds in some cases.

Feb 04 16:01:46 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 62 Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 02 Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00 Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 5a Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00 Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00 Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00 Feb 04 16:01:53 ptc_lab_collab3_node15 icsscand[8398]: Discarding byte 00

These mean that somehow we missed something (maybe a single byte, maybe more) in the bytestream coming from the device. You're seeing these messages because we somehow missed the "start of packet" identifier. This doesn't necessarily mean you're missing data, as it's likely one of the "heartbeat" status messages from the device that happen very often while connected.

Nevertheless, it should not be happening and we should get to the bottom of it. I'm still very curious why we're not seeing this on any of our testing machines. It may be down to the device or kernel, so I have finally acquired an OrangePi and will be doing further testing on that.

Also, would it make sense to retry ftdi calls in case of some errors as I am a bit worried about missed messages?

The read call will be immediately retried. The write will not, but I think it's best to understand what errors are occurring so the device doesn't receive duplicates.

The ValueCAN 3 132830 Warning: An unknown internal error occurred happened 3 more times. What is the best way to debug that?

You're correct that this error occurs in a few places. However, there are only two places it occurs during continued operation of icsscand. The first is after writeQueue.enqueue in communication/icommunication.cpp and the second is the ones you added to platform/posix/ftdi.cpp

For debugging, std::cout << "FTDI returned " << int(ret) << std::endl; will work just fine and end up in your log. If you can get us the failure code of the FTDI calls, we can look into it further. Looking at the FTDI source, there's only one error that can come from FTDI itself, everything else comes from libusb. To me, this again likely points to a momentary starvation of resources on your OrangePi, but this is certainly something we need to handle going forward.

I would say it is needed to print out error code into log or use ftdi_get_error_string, but I don't see a way to do that as there is no such API in libicsneo. Any suggestions?

You're correct, we only flag up from a list of predefined errors in libicsneo at the moment. Having arbitrary data attached to them is definitely something we can consider going forward. Like I said, std::cout will work just fine for debugging, though of course we don't include this in any releases.

Sorry for spamming, but reliability of icsscand is really critical for us.

It's understandable. In the coming weeks and months we will be getting ready to release Vehicle Spy X, and libicsneo is the hardware driver layer of that. You'll see this repo getting a lot more attention and stringent testing as we do more in that space.

tysonite commented 4 years ago

Unfortunately this would drastically affect latency. For some applications, round trip latency is extremely important, and the current system allows for latency in the hundreds of microseconds in some cases.

I understand. The original concern was about spinning, howerver, as far as I see wait_dequeue_timed also spins to block thread. That's probably why we see increased CPU usage while icsscand idle (no traffic on CAN bus). As I understand it now, this is something unavoidable with lock-free queues. Waking up thread looks also not a good idea to me as it will increase idle runs, and unrequired context switching will occur. Of course, I let you to decide how to proceed, the only thing I worry about is absence of deadlocks. JFYI, icsscand still runs good with my patch.

For debugging, std::cout << "FTDI returned " << int(ret) << std::endl; will work just fine and end up in your log.

Thanks. Added that into each place where uknown error might be reported during processing CAN data.

This doesn't necessarily mean you're missing data, as it's likely one of the "heartbeat" status messages from the device that happen very often while connected.

Is it possbile to prove that somehow internally in icsscand?

tysonite commented 4 years ago

I got 1st error message:

Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: FTDI returned = -7
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 01
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 00
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 03
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 02
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 00
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: message repeated 7 times: [ Discarding byte 00]
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte af
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte b7
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte f3
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 64
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte c9
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 05
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 00
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 00
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 01
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: Discarding byte 00
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: message repeated 2 times: [ Discarding byte 00]
Feb  6 06:46:06 ptc_lab_collab3_node15 icsscand[2722]: ValueCAN 3 132830 Warning: An unknown internal error occurred.

As per libusb error codes, it looks to be LIBUSB_ERROR_TIMEOUT. However, there is nothing suspicious in the dmesg output. I forgot add a log message whether it is read or write, but may be this will be clear for you, adding it now.

hollinsky-intrepid commented 4 years ago

It looks to me as though libusb will return partial reads in case of a timeout, however libftdi does not handle this properly and throws out that data.

Per the libusb docs:

Also check transferred when dealing with a timeout error code. libusb may have to split your transfer into a number of chunks to satisfy underlying O/S requirements, meaning that the timeout may expire after the first few chunks have completed. libusb is careful not to lose any data that may have been transferred; do not assume that timeout conditions indicate a complete lack of I/O.

However, the latest libftdi handles this as:

ret = libusb_bulk_transfer (ftdi->usb_dev, ftdi->out_ep, ftdi->readbuffer, ftdi->readbuffer_chunksize, &actual_length, ftdi->usb_read_timeout);
if (ret < 0)
    ftdi_error_return(ret, "usb bulk read failed");

I will have to dig into this properly and make modifications to libftdi.

For the sake of your testing, we can change platform/posix/ftdi.cpp:64 ftdi.setReadTimeout(100) to something like ftdi.setReadTimeout(1000). I believe the only negative effect that you'd see by doing this is that device closing times would be longer, and I'm sure this is not your greatest concern.

tysonite commented 4 years ago

It runs without issues for about a week now. The final patch used for testing is this one:

diff --git a/communication/icommunication.cpp b/communication/icommunication.cpp
index 9b1680a..4056d24 100644
--- a/communication/icommunication.cpp
+++ b/communication/icommunication.cpp
@@ -1,3 +1,4 @@
+#include <iostream>
 #include "icsneo/communication/icommunication.h"

 using namespace icsneo;
@@ -45,19 +46,20 @@ bool ICommunication::write(const std::vector<uint8_t>& bytes) {
        }

        if(writeBlocks) {
-               std::unique_lock<std::mutex> lk(writeMutex);
-               if(writeQueue.size_approx() > writeQueueSize)
-                       writeCV.wait(lk);
+                while (!writeQueue.try_enqueue(WriteOperation(bytes))) {}
        } else {
                if(writeQueue.size_approx() > writeQueueSize) {
                        report(APIEvent::Type::TransmitBufferFull, APIEvent::Severity::Error);
                        return false;
-               }
+               } else {
+                        bool ret = writeQueue.enqueue(WriteOperation(bytes));
+                        if(!ret) {
+                                std::cout << "writeQueue.enqueu() failed" << std::endl;
+                                report(APIEvent::Type::Unknown, APIEvent::Severity::Error);
+                        }
+                        return ret;
+                }
        }

-       bool ret = writeQueue.enqueue(WriteOperation(bytes));
-       if(!ret)
-               report(APIEvent::Type::Unknown, APIEvent::Severity::Error);
-
-       return ret;
-}
\ No newline at end of file
+       return true;
+}
diff --git a/platform/posix/ftdi.cpp b/platform/posix/ftdi.cpp
index 5f29fd6..fbec27e 100644
--- a/platform/posix/ftdi.cpp
+++ b/platform/posix/ftdi.cpp
@@ -61,7 +61,7 @@ bool FTDI::open() {
                return false;
        }

-       ftdi.setReadTimeout(100);
+       ftdi.setReadTimeout(1000);
        ftdi.setWriteTimeout(1000);
        ftdi.reset();
        ftdi.setBaudrate(500000);
@@ -184,6 +184,10 @@ void FTDI::readTask() {
                auto readBytes = ftdi.read(readbuf, READ_BUFFER_SIZE);
                if(readBytes > 0)
                        readQueue.enqueue_bulk(readbuf, readBytes);
+                if(readBytes < 0) {
+                        std::cout << "FTDI read returned = " << int(readBytes) << std::endl;
+                        report(APIEvent::Type::Unknown, APIEvent::Severity::Error);
+                }
        }
 }

@@ -194,7 +198,10 @@ void FTDI::writeTask() {
                if(!writeQueue.wait_dequeue_timed(writeOp, std::chrono::milliseconds(100)))
                        continue;

-               ftdi.write(writeOp.bytes.data(), (int)writeOp.bytes.size());
-               onWrite();
+               auto writeBytes = ftdi.write(writeOp.bytes.data(), (int)writeOp.bytes.size());
+                if (writeBytes < 0) {
+                         std::cout << "FTDI write returned = " << int(writeBytes) << std::endl;
+                         report(APIEvent::Type::Unknown, APIEvent::Severity::Error);
+                }
        }
-}
\ No newline at end of file
+}

As well, concurrentqueue library was updated to the latest version.

The last thing I am a bit concerned about is https://github.com/intrepidcs/icsscand/issues/3.