spectraphilic / wasp_sketches

Waspmote sketches develoded by our group
1 stars 0 forks source link

motes freezing #61

Closed jdavid closed 5 years ago

jdavid commented 5 years ago

Simonf [5:17 PM] @jdavid @John Hulth, Sven recovered two stations, and copied in this zip file the content of two stations. He brought both indoor to access the SDs. For both stations, the program stopped completely. I had loglevel in warn, so there is very little info Zip | Sven finse_stations_sd.zip 978 kB Zip — Click to download This is quite puzzling. Hard to figure what has happened (edited) one stopped thursday, the other a lot earlier on March 7

Simonf [8:43 PM] for the two stations, the weather data do not show a common patterns that can explain the station to stop (for one it was above freezing, RH 100% and WS 10m/s; for the other Tair=-7, RH=95% and WS=5.6m/s). I would then favor a code loophole somewhere but what is strange is that a whole batch of them died at the same time on march 7 and that fw.003 and fw-010 also died the same day

jdavid commented 5 years ago

Simonf [9:51 AM] Hello David, did you get a chance to look at the files? from the two motes?

jdavid [9:51 AM] hi, not yet, going to look now

Simonf [9:51 AM] to see if you can get further clue than I do in what lead the motes to stop running ok, maybe we can discuss later about it see if we can narrow down the issue

jdavid [9:56 AM] so all stations died?

Simonf [9:56 AM] well, all the one that can send data directly to the meshpi yes sven is there and will relaunch two of them what is puzzling is the number of station that stopped working on march 8 march 7 all at once some of them got their connection cut off, but still many with direct connection stopped at once. So I though this could be weather. however, two stopped on friday, and the weather had nothing related to March 7 I opened an issue about reviewing what we consider debug, info and warn I feel we put a whole lot in debug and discarded the hierarchy anyhow, I am curious to hear your diagnostic, because I can hardly think of clear hypothesis as to why they stopped I have launched one here with high frequency sampling rate with the hope that if it is an algo loophole we'll fall into it ...

jdavid [10:16 AM] the loop was configured to run every 5 minutes?

Simonf [10:16 AM] yes network every 20 and gps 48h i think

jdavid [10:18 AM] looking at the logs I think gps is 24h 1550845270.830 DEBUG command run gps 24 0

Simonf [10:19 AM] that might be an earlier setting. because I have 48h from the journal (in which I copy paste the print menu)

jdavid [10:20 AM] ah yes, that was at the very beginning of the log file

Simonf [10:20 AM] do you think we could save the timestamp as human readable in the log file. I think it would be easier to read

jdavid [10:21 AM] we're so sort on memory right now that would add may be not too much, but probably we need it for other stuff

Simonf [10:22 AM] ok, that was just a secondary consideration first that GPS is not working. but the second one has no GPS. so it is should not be related to that

Simonf [10:39 AM] the time stamp had been updated at 8:18 UTC on the old_met one (fw-004) could it have been smthg to do with that? the last frame sent from that one was 8:00 the other has unfortunately updated its time before Sven got the SD out of it

jdavid [10:43 AM] I've pushed a script that takes a log file and outputs the same log file with the time converted to easier to read format log_readable.py

Simonf [10:57 AM] what is the order of loglevel? info is less than debug, right

jdavid [11:00 AM] https://github.com/spectraphilic/wasp_sketches/blob/master/libraries/Coroutines/Coroutines.h#L119 debug is more detailed than info

Simonf [11:00 AM] I see I will change the measurement to trace then

jdavid [11:01 AM] I think we should always use debug level, until the motes are proven to run stable I think there's not any message with trace level, so not useful to run at that level I mean not different from debug well there're some, but commented git grep "trace("

jdavid [11:11 AM] which is the line in the old_met log file where the time has been updated? I don't see it

Simonf [11:11 AM] in the time.txt file

jdavid [11:11 AM] ah

Simonf [11:11 AM] last modified, and the timestamp itself as network is every 20min it should have send data at 8:20 but did not

jdavid [11:13 AM] good catch, the TMP file is modified at 08:15 and TIME.TXT is at 08:18 it means that last loop lasted 3 minutes

Simonf [11:14 AM] pretty weird

jdavid [11:14 AM] that's too much no? without network or gps

Simonf [11:14 AM] the watchdog?

jdavid [11:14 AM] it doesn't mean that the time changed the TIME.TXT file is saved at the end of every loop

Simonf [11:15 AM] i see then it was the DS2 and the lagopus as well as the MB in that loop

jdavid [11:17 AM] don't know how much it takes to get the measurements from that sensors, but it ~ 3min looks too much the time should not have changed, as it can only change from GPS or the network, but there's none in that loop

Simonf [11:21 AM] it takes 28s for those sensor to be read in one loop

jdavid [11:22 AM] from old_met line 1185 the mote starts at about 2019-02-27 15:11:44 and from TIME.TXT it ends at 2019-03-07 07:18:01 Untitled In [37]: ((datetime.datetime.utcfromtimestamp(1551943081) - datetime.datetime.utcfromtimestamp(1551280304.595)).total_seconds() / 60) Out[37]: 11046.273416666667 so it runs for ~ 11K minutes, if there's a loop every 5 minutes then that's 2209 loops so in theory it has not yet reached the 5.000 threshold where the mote automatically reboots 2 hypothesis : one is that after an automatic reboot there's somehow garbage in the uart and the mote enters the interactive command mode, and never leaves it that's in my todo, to use set the watchdog in the interactive mode so it doesn't stuck get forever but apparently the mote doesn't reboot there the other hypothesis is that the mote goes to sleep and never wakes up this may happen if there's a bug in the way the next wake up is calculated and the watchdog doesn't work as expected it looks quite unlikely unless the rtc malfunctions somehow but that's strange as well, since all motes stop at the same time, it looks something software

Simonf [11:29 AM] in the sleep time is there a safety net that is does go beyond 2 times the sampling interval?

jdavid [11:29 AM] the watchdog is the RTC as well, so if it malfunctions both features will be affected yes, let me check exactly

Simonf [11:30 AM] what is that interactive mode?

jdavid [11:31 AM] I mean the menu where you type commands

Simonf [11:31 AM] I see. this has never been an issue and we started them over and over

jdavid [11:31 AM] yes, but it's the only place where we don't have the watchdog running it's all in the https://github.com/spectraphilic/wasp_sketches/blob/master/libraries/WaspUIO/WaspUIO.cpp#L214

Simonf [11:32 AM] the watchdog is 1min?

jdavid [11:32 AM] the 1st line is where we save the time to TIME.TXT, this was successful as we have seen the watchdog is in line https://github.com/spectraphilic/wasp_sketches/blob/master/libraries/WaspUIO/WaspUIO.cpp#L247 it's the time expected for the next wakeup plus 2minutes if it goes to sleep at :18 then the watchdog should be 4 minutes after saving the time there's the auto reboot code, from the time it looks like it was far from been triggered, there's an info message, but we don't have it

Simonf [11:34 AM] but it hardly explain why all of them would get there at the same time

jdavid [11:35 AM] yes, that could only be explained because a bad bug in the way the next wake up is calculated we can test again whether the watchdog actually works the function that calculates when to trigger the next alarm is https://github.com/spectraphilic/wasp_sketches/blob/master/libraries/WaspUIO/WaspUIO.cpp#L184 I already reviewed it the last time we had issues, since it's always the 1st suspect do you know the battery level when the mote stopped and when it was recovered?

Simonf [11:39 AM] full

jdavid [11:39 AM] no activity, so like it was sleeping forever

Simonf [11:39 AM] yes. though it makes no sense why woudl that happen to multiple of them the fw-003 (flux) is running again sending data

jdavid [12:04 PM] maybe we should reduce the number of loops to auto reboot the mote

Simonf [12:04 PM] like every 1000?

jdavid [12:04 PM] yes it won't help us to understand what's going on, but may work better, the idea being that the mote becomes unstable after a long time running

Simonf [12:07 PM] if you take the other mote, I find a bulk part of 7600 loops which means it had rebooted at least once

jdavid [12:07 PM] yes battery wise the motes seem to work fine, battery levels are usually high, no?

Simonf [12:10 PM] yes very good there

jdavid [12:12 PM] ok I'm going to remove some old code that was there to save battery, but which was deactivated long time ago, it should save a bit of program memory

Simonf [12:13 PM] ok the battery of the fw-003 is still full, so it did not drain it during the time it was off which we had in the past and all sensors are back working

jdavid [12:15 PM] when it died probably it was stopped, not using power, the battery probably charged

Simonf [12:15 PM] correct

jdavid [12:15 PM] the thing is which was the battery level just before it died

Simonf [12:15 PM] 100

jdavid [12:15 PM] ok

Simonf [12:21 PM] I am going to eat. let me know if you have any new ideas

jdavid [12:52 PM] Untitled $ ls -l --time-style=full-iso old_met/ total 96 drwxr-xr-x 2 jdavid jdavid 4096 2019-02-22 15:20:58.000000000 +0100 DATA -rw-r--r-- 1 jdavid jdavid 80780 2019-03-07 07:00:20.000000000 +0100 LOG.TXT -rw-r--r-- 1 jdavid jdavid 4 2019-03-07 08:00:10.000000000 +0100 QSTART.BIN -rw-r--r-- 1 jdavid jdavid 10 2019-03-07 08:18:00.000000000 +0100 TIME.TXT -rw-r--r-- 1 jdavid jdavid 136 2019-03-07 08:15:26.000000000 +0100 TMP.TXT Collapse the TMP file is written at 08:15:26 so it took 26s to read the sensors, that matches the 28s you said (edited) what the mote did for 2m34s, from 08:15:26 to 08:18:00 ?

Simonf [1:01 PM] LOG_uw-test.TXT 1999-12-31 00:00:10.448000 DEBUG command loglevel 5 1999-12-31 00:00:16.818000 DEBUG command board 1 1999-12-31 00:00:23.990000 DEBUG command name sw-012 1999-12-31 00:00:26.065000 DEBUG command print 1999-12-31 00:00:43.125000 DEBUG command time 19:03:18:10:56:25 2019-03-18 10:56:28.392000 DEBUG command exit 2019-03-18 10:56:28.419000 DEBUG No space left in frame 2019-03-18 10:56:28.534000 INFO Frame saved to SD (0 fields in 21 bytes) 2019-03-18 10:56:28.556000 INFO Id : 0000000000000000 Version=J Name=sw-012 2019-03-18 10:56:28.581000 INFO Battery : Lithium-ion (100 %) 2019-03-18 10:56:28.602000 INFO Hardware : board=lemming SD=1 GPS=1 2019-03-18 10:56:28.629000 INFO Frames : payload-size=0 frame-size=0 encryption=disabled 2019-03-18 10:56:28.652000 INFO Log : level=DEBUG output=USB, SD 2019-03-18 10:56:28.680000 INFO Actions : net=10h10 bat=10h10 gps=10h10 acc=10h10 ctd=10h10 ds2=10h10 ds1820=10h10 bme76=10h10 mb=10h10 ws100=10h10 as7263=10h 2019-03-18 10:56:28.711000 INFO Boot done, go to sleep 2019-03-18 10:56:28.764000 DEBUG Next alarm at 00:19:40:00 in 524 minutes 2000-01-03 23:22:58 WARN Wrong time detected 2019-03-18 10:56:28.003000 INFO Time loaded from TIME.TXT 2019-03-18 10:56:36.549000 DEBUG command run bat 0 5 2019-03-18 10:56:41.977000 DEBUG command run net 0 0 2019-03-18 10:56:45.363000 DEBUG command run acc 0 5 2019-03-18 10:56:48.146000 DEBUG command run gps 1 0 2019-03-18 10:56:51.573000 DEBUG command run ctd 0 0 2019-03-18 10:56:55.013000 DEBUG command run ws100 0 0 2019-03-18 10:57:00.111000 DEBUG command run atmos 0 0 2019-03-18 10:57:06.553000 DEBUG command run as7263 0 0 2019-03-18 10:57:13.110000 DEBUG command run as7265 0 0 2019-03-18 10:57:18.234000 DEBUG command run ds2 0 0 2019-03-18 10:57:23.473000 DEBUG command run ds1820 0 0 2019-03-18 10:57:26.519000 DEBUG command run bme 0 5 2019-03-18 10:57:33.123000 DEBUG command run vl 0 5 2019-03-18 10:57:37.083000 DEBUG command run mlx 0 5 2019-03-18 10:57:43.024000 DEBUG command run bme76 0 5 2019-03-18 10:57:45.813000 DEBUG command run mb 0 5 2019-03-18 10:57:47.401000 DEBUG command print 2019-03-18 10:57:51.623000 DEBUG command run tmp 0 5 2019-03-18 10:57:55.674000 DEBUG command print 2019-03-18 10:58:06.567000 DEBUG command i2c 2019-03-18 10:58:09.804000 DEBUG command exit 2019-03-18 10:58:09.831000 DEBUG No space left in frame 2019-03-18 10:58:09.916000 INFO Frame saved to SD (0 fields in 21 bytes) 2019-03-18 10:58:09.939000 INFO Id : 175E18FDC337DEF1 Version=J Name=sw-012 2019-03-18 10:58:09.964000 INFO Battery : Lithium-ion (100 %) 2019-03-18 10:58:09.991000 INFO Hardware : board=lemming SD=1 GPS=1 2019-03-18 10:58:10.014000 INFO Frames : payload-size=0 frame-size=0 encryption=disabled 2019-03-18 10:58:10.039000 INFO Log : level=DEBUG output=USB, SD 2019-03-18 10:58:10.065000 INFO Actions : bat=5m gps=1h acc=5m bme76=5m mb=5m bme=5m mlx=5m tmp=5m vl=5m 2019-03-18 10:58:10.090000 INFO Boot done, go to sleep 2019-03-18 10:58:10.159000 DEBUG Next alarm at 00:11:00:00 in 2 minutes 2019-03-18 11:00:00.001000 INFO Loop 1 battery=Lithium-ion (100 %) 2019-03-18 11:00:00.042000 DEBUG No space left in frame 2019-03-18 11:00:00.070000 DEBUG No space left in frame 2019-03-18 11:00:00.090000 DEBUG No space left in frame 2019-03-18 11:00:00.110000 DEBUG No space left in frame 2019-03-18 11:00:00.160000 INFO ACC x=77, y=-8, z=948 2019-03-18 11:00:00.181000 DEBUG No space left in frame 2019-03-18 11:00:00.201000 DEBUG No space left in frame 2019-03-18 11:00:00.221000 DEBUG No space left in frame 2019-03-18 11:00:00.298000 DEBUG BME280 Temperature: 23.15 Celsius Degrees 2019-03-18 11:00:00.321000 DEBUG BME280 Humidity : 18.82 %RH 2019-03-18 11:00:00.341000 DEBUG BME280 Pressure : 98835.80 Pa 2019-03-18 11:00:00.363000 DEBUG No space left in frame 2019-03-18 11:00:00.392000 DEBUG No space left in frame 2019-03-18 11:00:00.413000 DEBUG No space left in frame 2019-03-18 11:00:00.491000 DEBUG BME280 Temperature: 23.04 Celsius Degrees 2019-03-18 11:00:00.515000 DEBUG BME280 Humidity : 18.06 %RH 2019-03-18 11:00:00.537000 DEBUG BME280 Pressure : 98808.80 Pa 2019-03-18 11:00:00.559000 DEBUG No space left in frame 2019-03-18 11:00:00.580000 DEBUG No space left in frame 2019-03-18 11:00:00.600000 DEBUG No space left in frame 2019-03-18 11:00:00.622000 DEBUG MLX 90614 object=23.02 2019-03-18 11:00:00.643000 DEBUG MLX 90614 ambient=22.91 2019-03-18 11:00:00.665000 DEBUG No space left in frame 2019-03-18 11:00:00.692000 DEBUG No space left in frame 2019-03-18 11:00:00.713000 DEBUG No space left in frame 2019-03-18 11:00:00.738000 DEBUG TMP102 22.56 Celsius 2019-03-18 11:00:00.759000 DEBUG No space left in frame 2019-03-18 11:00:00.780000 DEBUG No space left in frame 2019-03-18 11:00:00.800000 DEBUG No space left in frame 2019-03-18 11:00:00.924000 DEBUG Distance 0 (mm): 0 2019-03-18 11:00:01.147000 DEBUG Distance 1 (mm): 0 2019-03-18 11:00:01.369000 DEBUG Distance 2 (mm): 0 2019-03-18 11:00:01.591000 DEBUG Distance 3 (mm): 0 2019-03-18 11:00:01.812000 DEBUG Distance 4 (mm): 0 2019-03-18 11:00:02.034000 DEBUG Distance 5 (mm): 0 2019-03-18 11:00:02.262000 DEBUG Distance 6 (mm): 0 2019-03-18 11:00:02.484000 DEBUG Distance 7 (mm): 0 2019-03-18 11:00:02.707000 DEBUG Distance 8 (mm): 0 2019-03-18 11:00:02.929000 DEBUG Distance 9 (mm): 0 2019-03-18 11:00:03.151000 DEBUG Distance 10 (mm): 0 2019-03-18 11:00:03.372000 DEBUG Distance 11 (mm): 0 2019-03-18 11:00:03.594000 DEBUG Distance 12 (mm): 0 2019-03-18 11:00:03.817000 DEBUG Distance 13 (mm): 0 2019-03-18 11:00:04.039000 DEBUG Distance 14 (mm): 0 2019-03-18 11:00:04.261000 DEBUG Distance 15 (mm): 0 2019-03-18 11:00:04.482000 DEBUG Distance 16 (mm): 0 2019-03-18 11:00:04.704000 DEBUG Distance 17 (mm): 0 2019-03-18 11:00:04.927000 DEBUG Distance 18 (mm): 0 2019-03-18 11:00:05.154000 DEBUG Distance 19 (mm): 0 2019-03-18 11:00:05.377000 DEBUG Distance 20 (mm): 0 2019-03-18 11:00:05.599000 DEBUG Distance 21 (mm): 0 2019-03-18 11:00:05.821000 DEBUG Distance 22 (mm): 0 2019-03-18 11:00:06.042000 DEBUG Distance 23 (mm): 0 2019-03-18 11:00:06.264000 DEBUG Distance 24 (mm): 0 2019-03-18 11:00:06.487000 DEBUG Distance 25 (mm): 0 2019-03-18 11:00:06.709000 DEBUG Distance 26 (mm): 0 2019-03-18 11:00:06.931000 DEBUG Distance 27 (mm): 0 2019-03-18 11:00:07.152000 DEBUG Distance 28 (mm): 0 2019-03-18 11:00:07.374000 DEBUG Distance 29 (mm): 0 2019-03-18 11:00:07.595000 DEBUG No space left in frame 2019-03-18 11:00:07.622000 DEBUG No space left in frame 2019-03-18 11:00:07.643000 DEBUG No space left in frame 2019-03-18 11:00:11.565000 WARN readMaxbotixSerial: fail to read MB7389 2019-03-18 11:00:11.589000 DEBUG GPS start 2019-03-18 11:02:48.860000 ERROR GPS Timeout 2019-03-18 11:02:48.887000 INFO Loop done in 168887 ms (CPU time 168887 ms). 2019-03-18 11:02:48.963000 DEBUG Next alarm at 00:11:05:00 in 3 minutes 2019-03-18 11:05:00.001000 INFO Loop 2 battery=Lithium-ion (100 %) 2019-03-18 11:05:00.024000 DEBUG No space left in frame 2019-03-18 11:05:00.047000 DEBUG No space left in frame 2019-03-18 11:05:00.068000 DEBUG No space left in frame 2019-03-18 11:05:00.097000 DEBUG No space left in frame 2019-03-18 11:05:00.148000 INFO ACC x=71, y=8, z=932 2019-03-18 11:05:00.171000 DEBUG No space left in frame 2019-03-18 11:05:00.192000 DEBUG No space left in frame 2019-03-18 11:05:00.213000 DEBUG No space left in frame 2019-03-18 11:05:00.291000 DEBUG BME280 Temperature: 23.61 Celsius Degrees 2019-03-18 11:05:00.315000 DEBUG BME280 Humidity : 19.21 %RH 2019-03-18 11:05:00.339000 DEBUG BME280 Pressure : 98851.32 Pa 2019-03-18 11:05:00.360000 DEBUG No space left in frame 2019-03-18 11:05:00.381000 DEBUG No space left in frame 2019-03-18 11:05:00.401000 DEBUG No space left in frame 2019-03-18 11:05:00.483000 DEBUG BME280 Temperature: 22.87 Celsius Degrees 2019-03-18 11:05:00.507000 DEBUG BME280 Humidity : 19.64 %RH 2019-03-18 11:05:00.530000 DEBUG BME280 Pressure : 98823.35 Pa 2019-03-18 11:05:00.550000 DEBUG No space left in frame 2019-03-18 11:05:00.570000 DEBUG No space left in frame 2019-03-18 11:05:00.590000 DEBUG No space left in frame 2019-03-18 11:05:00.612000 DEBUG MLX 90614 object=22.69 2019-03-18 11:05:00.633000 DEBUG MLX 90614 ambient=22.45 2019-03-18 11:05:00.654000 DEBUG No space left in frame 2019-03-18 11:05:00.675000 DEBUG No space left in frame 2019-03-18 11:05:00.697000 DEBUG No space left in frame 2019-03-18 11:05:00.727000 DEBUG TMP102 22.37 Celsius 2019-03-18 11:05:00.748000 DEBUG No space left in frame 2019-03-18 11:05:00.769000 DEBUG No space left in frame 2019-03-18 11:05:00.790000 DEBUG No space left in frame 2019-03-18 11:05:00.914000 DEBUG Distance 0 (mm): 0 2019-03-18 11:05:01.137000 DEBUG Distance 1 (mm): 0 2019-03-18 11:05:01.359000 DEBUG Distance 2 (mm): 0 2019-03-18 11:05:01.581000 DEBUG Distance 3 (mm): 0 2019-03-18 11:05:01.802000 DEBUG Distance 4 (mm): 0 2019-03-18 11:05:02.024000 DEBUG Distance 5 (mm): 0 2019-03-18 11:05:02.247000 DEBUG Distance 6 (mm): 0 2019-03-18 11:05:02.469000 DEBUG Distance 7 (mm): 0 2019-03-18 11:05:02.697000 DEBUG Distance 8 (mm): 0 2019-03-18 11:05:02.919000 DEBUG Distance 9 (mm): 0 2019-03-18 11:05:03.141000 DEBUG Distance 10 (mm): 0 2019-03-18 11:05:03.362000 DEBUG Distance 11 (mm): 0 2019-03-18 11:05:03.584000 DEBUG Distance 12 (mm): 0 2019-03-18 11:05:03.807000 DEBUG Distance 13 (mm): 0 2019-03-18 11:05:04.029000 DEBUG Distance 14 (mm): 0 2019-03-18 11:05:04.251000 DEBUG Distance 15 (mm): 0 2019-03-18 11:05:04.472000 DEBUG Distance 16 (mm): 0 2019-03-18 11:05:04.694000 DEBUG Distance 17 (mm): 0 2019-03-18 11:05:04.917000 DEBUG Distance 18 (mm): 0 2019-03-18 11:05:05.139000 DEBUG Distance 19 (mm): 0 2019-03-18 11:05:05.367000 DEBUG Distance 20 (mm): 0 2019-03-18 11:05:05.589000 DEBUG Distance 21 (mm): 0 2019-03-18 11:05:05.811000 DEBUG Distance 22 (mm): 0 2019-03-18 11:05:06.032000 DEBUG Distance 23 (mm): 0 2019-03-18 11:05:06.254000 DEBUG Distance 24 (mm): 0 2019-03-18 11:05:06.477000 DEBUG Distance 25 (mm): 0 2019-03-18 11:05:06.699000 DEBUG Distance 26 (mm): 0 2019-03-18 11:05:06.921000 DEBUG Distance 27 (mm): 0 2019-03-18 11:05:07.142000 DEBUG Distance 28 (mm): 0 2019-03-18 11:05:07.364000 DEBUG Distance 29 (mm): 0 2019-03-18 11:05:07.585000 DEBUG No space left in frame 2019-03-18 11:05:07.607000 DEBUG No space left in frame 2019-03-18 11:05:07.628000 DEBUG No space left in frame 2019-03-18 11:05:11.557000 WARN readMaxbotixSerial: fail to read MB7389 2019-03-18 11:05:11.580000 INFO Loop done in 11580 ms (CPU time 11580 ms). 2019-03-18 11:05:11.651000 DEBUG Next alarm at 00:11:10:00 in 5 minutes 2019-03-18 11:10:00.001000 INFO *** Loop 3 battery=Lithium-ion (100 %) 2019-03-18 11:10:00.035000 DEBUG No space left in frame 2019-03-18 11:10:00.057000 DEBUG No space left in frame 2019-03-18 11:10:00.078000 DEBUG No space left in frame 2019-03-18 11:10:00.101000 DEBUG No space left in frame 2019-03-18 11:10:00.152000 INFO ACC x=67, y=0, z=958 2019-03-18 11:10:00.174000 DEBUG No space left in frame 2019-03-18 11:10:00.201000 DEBUG No space left in frame 2019-03-18 11:10:00.222000 DEBUG No space left in frame 2019-03-18 11:10:00.300000 DEBUG BME280 Temperature: 23.38 Celsius Degrees 2019-03-18 11:10:00.324000 DEBUG BME280 Humidity : 19.48 %RH 2019-03-18 11:10:00.345000 DEBUG BME280 Pressure : 98868.42 Pa 2019-03-18 11:10:00.366000 DEBUG No space left in frame 2019-03-18 11:10:00.387000 DEBUG No space left in frame 2019-03-18 11:10:00.408000 DEBUG No space left in frame 2019-03-18 11:10:00.486000 DEBUG BME280 Temperature: 22.85 Celsius Degrees 2019-03-18 11:10:00.511000 DEBUG BME280 Humidity : 19.08 %RH 2019-03-18 11:10:00.537000 DEBUG BME280 Pressure : 98827.40 Pa 2019-03-18 11:10:00.560000 DEBUG No space left in frame 2019-03-18 11:10:00.582000 DEBUG No space left in frame 2019-03-18 11:10:00.603000 DEBUG No space left in frame 2019-03-18 11:10:00.625000 DEBUG MLX 90614 object=22.69 2019-03-18 11:10:00.646000 DEBUG MLX 90614 ambient=22.45 2019-03-18 11:10:00.667000 DEBUG No space left in frame 2019-03-18 11:10:00.688000 DEBUG No space left in frame 2019-03-18 11:10:00.710000 DEBUG No space left in frame 2019-03-18 11:10:00.734000 DEBUG TMP102 22.37 Celsius 2019-03-18 11:10:00.754000 DEBUG No space left in frame 2019-03-18 11:10:00.774000 DEBUG No space left in frame 2019-03-18 11:10:00.800000 DEBUG No space left in frame 2019-03-18 11:10:00.906000 DEBUG Distance 0 (mm): 0 2019-03-18 11:10:01.128000 DEBUG Distance 1 (mm): 0 2019-03-18 11:10:01.351000 DEBUG Distance 2 (mm): 0 2019-03-18 11:10:01.573000 DEBUG Distance 3 (mm): 0 2019-03-18 11:10:01.794000 DEBUG Distance 4 (mm): 0 2019-03-18 11:10:02.015000 DEBUG Distance 5 (mm): 0 2019-03-18 11:10:02.237000 DEBUG Distance 6 (mm): 0 2019-03-18 11:10:02.460000 DEBUG Distance 7 (mm): 0 2019-03-18 11:10:02.... Collapse This snippet was truncated for display; see it in full I launched this one this morning. and just recovered it it did 4 loops it is not the calculation of the alarm the prb and here I have compile with all sensor type but no network which equate to about 85% of memory usage

jdavid [1:10 PM] when you started the mote again after this crash the time was correct?

Simonf [1:10 PM] yes

jdavid [1:12 PM] the motes that have stopped, were all the J version of the firmware?

Simonf [1:13 PM] yes

jdavid [1:13 PM] and all the motes stopped? none survived?

Simonf [1:14 PM] yes well the one one hop away from the pi I cannot tell but all the one within the reach of the pi have stopped so you think it could be a firmware bug

jdavid [1:16 PM] maybe, we've had a couple of issues specific to the J already

Simonf [1:16 PM] that would be very frustrating

jdavid [1:16 PM] it's only an hypothesis

Simonf [1:16 PM] yes yes :slightly_smiling_face:

jdavid [1:17 PM] we go to sleep and don't awake, so it's like the RTC is not doing its job but when we turn it on again the RTC time is good

Simonf [1:17 PM] correct for instance, the one sven relaunched, he just push the on/off button as he had no USB cable and the timestamp is all fine just another instance with same symptoms

jdavid [1:23 PM] this is where we log the last message of the loop, https://github.com/spectraphilic/wasp_sketches/blob/master/libraries/WaspUIO/WaspUIO.cpp#L228 we do a few things before going to sleep, but the fact that the motes were recovered with full battery, it seems the mote really went to sleep

Simonf [1:24 PM] yes

jdavid [1:24 PM] maybe you can test the watchdog really works with the J

Simonf [1:24 PM] how would I do that?

jdavid [1:25 PM] just before line 251, https://github.com/spectraphilic/wasp_sketches/blob/master/libraries/WaspUIO/WaspUIO.cpp#L251 actually, in line 251 you can replace alarmTime by a hard coded string

Simonf [1:25 PM] there two setWatchdog for the version above or equal to H line 243 and after

jdavid [1:26 PM] there's one before the mote sleeps, to force it to wake up if it doesn't wakes up by itself another one after it wakes up, in case the mote gets stuck while running in the loop, e.g. GPS taking too long replace PWR.deepSleep(alarmTime, RTC_ABSOLUTE, RTC_ALM1_MODE3, ALL_OFF); by ... PWR.deepSleep("00:23:59:00", RTC_ABSOLUTE, RTC_ALM1_MODE3, ALL_OFF); (edited) that means the mote will wake up this night at 23:59 so the watchdog should be triggered and the mote rebooted then you enter an infinite loop of reboots I'll do the same test here

Simonf [1:32 PM] and the other potential one is ```PWR.clearInterruptionPin(); (edited)

jdavid [1:33 PM] yes..

Simonf [1:34 PM] the watchdog worked

jdavid [1:35 PM] ok, here too

jdavid [1:40 PM] in http://www.libelium.com/downloads/documentation/waspmote-interruptions-programming_guide.pdf section 3.7 they say to call PWR.clearInterruptionPin(); before going to sleep there's an example below we do as they say, it's supposed to increase robustness, they say, have not looked deeper than that

Simonf [1:41 PM] it feels like the candidate for a bug though Untitled 4.4. Clear alarm flags It clears alarm flags ( A1F and A2F ) in the RTC. The alarm flag A2F is related to the Watchdog functionality. If these flags are not cleared after an interrupt is captured, no more interrupts will be captured. Example of use: {

} RTC.clearAlarmFlag(); Collapse p11 of the RTC programing guide I will read carefully the interruption guide

jdavid [1:44 PM] we don't call that one anywhere not directly at least, there're a number of places in the api where it's called it's called everytime the RTC is switched ON, so we should call it at least once https://github.com/spectraphilic/waspmoteapi-uio/blob/v036/waspmote-api/WaspRTC.cpp#L112 oh there's a new version of the api, v040, so soon after v039

Simonf [1:49 PM] i saw that but it is for some ftp support

jdavid [1:49 PM] ah ok

Simonf [1:50 PM] why wouldn't we clear the RTC flags before the sleep ? just to prevent any prb

jdavid [1:53 PM] actually I think we do, just not directly when we set the watchdog, or program the rtc to awake as later these all are calls to the RTC, which first must be switched on, and that calls to clearAlarmFlag unless the RTC was ON already, but I'm pretty sure it's not there's an option to debug the RTC, in waspmote-api/WaspRTC.h define DEBUG_RTC 1 will try it now this afternoon I'll have to work on another project, but still be available here, I've to split my time

Simonf [1:58 PM] ok

jdavid [1:59 PM] you succeeded to reproduce the error in only 4 loops, maybe it was only good luck, but it would help a lot if you found a configuration where the issue pops up quickly

Simonf [1:59 PM] there are 2 watchdog

jdavid [2:00 PM] yes, the other one is internal to the MCU

Simonf [2:00 PM] I think it was luck

jdavid [2:00 PM] only useful for short periods of time

Simonf [2:00 PM] ok

jdavid [2:01 PM] up to 8s it's defined in waspmote-api/WaspPWR.h

Simonf [2:02 PM] could it have anything to do with the ACC, as we have not used it in the past, and it has interuption capabilities

jdavid [2:02 PM] may be, and it uses I2C as well, just like the RTC

jdavid [2:09 PM] well DEBUG_RTC is useless, they don't use it will insert some prints

Simonf [2:09 PM] :grinning: p25, they speak about interference with the multiplexer but I suspect this is taken care of already

jdavid [2:14 PM] Untitled 1553605987.909 INFO Frame saved to SD (4 fields in 68 bytes) 1553605987.933 INFO Loop done in 7933 ms 1553605988.010 DEBUG Next alarm at 00:13:14:00 in 1 minutes [RTC] WaspRTC:ON adding a print in the WaspRTC::ON method it's called once before we go to sleep so the alarm flags should be cleared already have to go for lunch

Simonf [2:15 PM] sure enjoy

jdavid [7:04 PM] have you found something else?

Simonf [7:04 PM] no I have launched a mote with 4G I need to look at it as it does not send data (edited) I explained the prb to John who just came back from the field but earlier that same mote was behaving all fine

jdavid [7:07 PM] I'll open an issue and post the discussion here

Simonf [7:07 PM] thanks it looks like the pi is unreachable

ArcticSnow commented 5 years ago

Should we close that one? or at least put aside as we figured that there is a great chance this was due to shorts on the I2C line.

jdavid commented 5 years ago

Okay. Just a couple of questions. We have had a number of issues as well with other motes in Oslo:

Questions:

ArcticSnow commented 5 years ago

Concerning your questions:

jdavid commented 5 years ago

ok closing