rand256 / valetudo

Valetudo RE - experimental vacuum software, cloud free
Apache License 2.0
671 stars 74 forks source link

Extend and improve logging #140

Open TheAssassin opened 4 years ago

TheAssassin commented 4 years ago

I think Valetudo should provide much more logging to allow for analyzing and debugging problems. Right now, Valetudo only really logs if some commands(?) time out (i.e., they're retried more than 25 times), and during startup.

There should be more messages over the entire code base. They could be debug messages, so they're not visible by default. Using e.g., an environment variable, this debug logging could then be turned on when analyzing problems like #139.

pidator commented 4 years ago

Keep in mind, that there's still the random-reset-issue that seems to be related with writes or changes of files on the system partition. Nobody has found the actual reason for this behavior, so everything is just guessing. For this reason, some users disable logging completley and don't want to get their robots messed up.

TheAssassin commented 4 years ago

@pidator I've read about it in this other issue where you reported it. Hence the proposal to use an environment variable to turn on the logging. I imagine the workflow as to ssh on the bot, then stop valetudo and then run it manually like env VALETUDO_DEBUG=1 /usr/local/bin/valetudo.

Another somewhat simple but powerful way to avoid writing to the system partition (and also to /run/shm, which, as a tmpfs "ramdisk", is limited in size as well) is to use an external service hosted on some other computer. syslog was built that way for instance, and despite it's super aged, it works fine for a lot of equipment.

Edit: I've had the bot for < 1 week now, and I've reflashed it quite often so far. I can't tell about any long term problems.

@pidator regarding your random-reset issue, how about running collectd on the bot collecting some stats on the disk and RAM usage? That could also provide valuable insights on why it resets.

pidator commented 4 years ago

... is to use an external service hosted on some other computer. syslog was built that way ...

I had configured syslog at the beginning to send *.* all data to my syslogserver, but most of the (kernel) messages were not very helpfull. It seems I forgot to enable syslog after a flashing at the beginning of february, so I can't tell anything about messages that are send just before my random-reset this week. I will enable syslog again and wait for the next reset.

@pidator regarding your random-reset issue, how about running collectd on the bot collecting some stats on the disk and RAM usage? That could also provide valuable insights on why it resets.

As noboy exactly knows whats triggering the random reset, this could indeed be helpfull to figure out more about the robot state before an reset. Perhaps this could led to new findings about the issue. Have you already got collectd installed at the bot? I would also help with monitoring my robot!

pidator commented 4 years ago

I will enable syslog again and wait for the next reset.

root@rockrobo:~# echo "*.* @192.168.xxx.xx:514" >> /etc/rsyslog.conf
root@rockrobo:~# service rsyslog restart
rsyslog stop/waiting
rsyslog start/running, process 29172

syslog enabled on the robot and on server side you'll begin to see messages like

Feb 28 22:58:04 rockrobo kernel: [71409.653749] RTL871X: SetHwReg8188E:(HW_VAR_CHECK_TXBUF)TXBUF Empty(1) in 0 ms
Feb 28 22:58:04 rockrobo kernel: [71409.653949] RTL871X: SetHwReg8188E(wlan0): [HW_VAR_MACID_SLEEP] macid=0, org reg_0x48c=0x00000000
Feb 28 22:58:04 rockrobo kernel: [71410.020363] RTL871X: Drop duplicate management frame with seq_num = 1100.
Feb 28 22:58:04 rockrobo kernel: [71410.025367] RTL871X: Drop duplicate management frame with seq_num = 1100.
Feb 28 22:58:05 rockrobo kernel: [71411.102767] RTL871X: SetHwReg8188E(wlan0): [HW_VAR_MACID_WAKEUP] macid=0, org reg_0x48c=0x00000001
Feb 28 22:58:05 rockrobo kernel: [71411.104989] RTL871X: survey done event(10) band:0 for wlan0
Feb 28 22:58:05 rockrobo kernel: [71411.105162] RTL871X: rtw_indicate_scan_done(wlan0)
Feb 28 22:58:34 rockrobo kernel: [71439.766207] RTL871X: cfg80211_rtw_scan(wlan0)
Feb 28 22:58:34 rockrobo kernel: [71439.766628] RTL871X: SetHwReg8188E:(HW_VAR_CHECK_TXBUF)TXBUF Empty(1) in 0 ms
Feb 28 22:58:34 rockrobo kernel: [71439.766828] RTL871X: SetHwReg8188E(wlan0): [HW_VAR_MACID_SLEEP] macid=0, org reg_0x48c=0x00000000
Feb 28 22:58:34 rockrobo kernel: [71440.125588] RTL871X: Drop duplicate management frame with seq_num = 1125.
Feb 28 22:58:34 rockrobo kernel: [71440.135447] RTL871X: Drop duplicate management frame with seq_num = 1126.
Feb 28 22:58:35 rockrobo kernel: [71441.212578] RTL871X: SetHwReg8188E(wlan0): [HW_VAR_MACID_WAKEUP] macid=0, org reg_0x48c=0x00000001
Feb 28 22:58:35 rockrobo kernel: [71441.214001] RTL871X: survey done event(11) band:0 for wlan0
Feb 28 22:58:35 rockrobo kernel: [71441.214166] RTL871X: rtw_indicate_scan_done(wlan0)
Feb 28 22:59:04 rockrobo kernel: [71469.880320] RTL871X: cfg80211_rtw_scan(wlan0)
Feb 28 22:59:04 rockrobo kernel: [71469.880736] RTL871X: SetHwReg8188E:(HW_VAR_CHECK_TXBUF)TXBUF Empty(1) in 0 ms
Feb 28 22:59:04 rockrobo kernel: [71469.880938] RTL871X: SetHwReg8188E(wlan0): [HW_VAR_MACID_SLEEP] macid=0, org reg_0x48c=0x00000000
Feb 28 22:59:05 rockrobo kernel: [71471.332548] RTL871X: SetHwReg8188E(wlan0): [HW_VAR_MACID_WAKEUP] macid=0, org reg_0x48c=0x00000001
Feb 28 22:59:05 rockrobo kernel: [71471.334012] RTL871X: survey done event(13) band:0 for wlan0
Feb 28 22:59:05 rockrobo kernel: [71471.334175] RTL871X: rtw_indicate_scan_done(wlan0)
Feb 28 22:59:34 rockrobo kernel: [71499.992584] RTL871X: cfg80211_rtw_scan(wlan0)