opendata-stuttgart / sensors-software

sourcecode for reading sensor data
574 stars 313 forks source link

New beta (published April 2, 2024), please add problems here #1020

Open ricki-z opened 7 months ago

ricki-z commented 7 months ago

At the moment the SCD30 seems to stop after some time. Looks like an I2C issue, bus speed might needed to decreased. But this may cause problems with other devices.

Phaze-III commented 7 months ago

Thanks for creating this issue! The current firmware (from the alpha directory and compiled from the current beta branch) runs very unstable on my sensors (nodemcu v2, one with SDS011 and BME280, the other just for testing firmware without any sensor at the moment).

They frequently crash with a 'Software Watchdog' or 'Exception' error. Largest uptime is less than 3 hours. The nodes do recover but will show lots of gaps in the measurements. This most likely is caused by the combination of parts of #1017 and #1019. I suspect that with adding the SEN5X code memory is sometimes exhausted.

Reverting #1019 results in stable firmware capable of OTA-upgrades again but no SEN5X.

Reverting relevant parts of #1017 also gives a relatively stable firmware with SEN5X but that re-introduces the OTA-problem (#915) again, the node completely freezes when trying to download an update and can only be restored by re-flashing.

I've created a few branches to test the above mentioned reverts:

https://github.com/Phaze-III/sensors-software/tree/feature/revert-beta-sen5x-part-of-pr1019 https://github.com/Phaze-III/sensors-software/tree/feature/revert-beta-ota-workaround-part-of-pr1017

I also looked into isolating the SEN5X code to make it possible to (de)activate the code with a compile time define. That code (created with some diff -D magic) is available at commit d2683781fa5f4d33feccb0699ef4f18d8440aed3 (branch https://github.com/Phaze-III/sensors-software/tree/feature/beta-ifdef-sen5x-enclosing )

We might do something like that for other sensors and perhaps provide pre-compiled binaries for different combinations of sensors.

ricki-z commented 7 months ago

I've a node that is running since yesterday evening: https://api-rrd.madavi.de:3000/grafana/d/GUaL5aZMz/pm-sensors?var-chipID=esp8266-13597771&orgId=1 This node seems to run without the described problems...

Phaze-III commented 7 months ago

I do see some gaps which could also be upload errors. The last one around 18:06. What's the uptime on that node?

ricki-z commented 7 months ago

Okay uptime is 4 hours, 37 minutes (at 22:47 CEST). Reset reason was an exception. I think we should remove the SEN5x code for the moment. Priority should be a stable version for the new certs.

Phaze-III commented 7 months ago

That looks like a crash around the time of the gap in the graph. My guess would be that if you regularly check the uptime that it will be e few hours at most.

jmparatte commented 7 months ago

At the moment the SCD30 seems to stop after some time. Looks like an I2C issue, bus speed might needed to decreased. But this may cause problems with other devices.

My SCD30 stopped because a small insect laid eggs... After disassembly of the SCD30, cleaning and reassembly, everything is now OK.

Phaze-III commented 7 months ago

I think we should remove the SEN5x code for the moment. Priority should be a stable version for the new certs.

Agreed. I created #1022 to do that. If needed I can also create a PR to sync the current beta-sen5x with beta so people can still use that branch to compile SEN5X firmware.

Phaze-III commented 7 months ago

I've also been looking at other potential stability issues. There are two situations where one might lose control of the sensor. Both happen when selecting a sensor in the configuration without actually having that type of sensor connected and most likely if there are wiring problems with the sensor.

  1. Select 'Tera Sensor Next PM' and save/restart. After restarting the sensor will go into a really fast loop to "Wait for Serial" and getting into the web-interface to fix it is not possible anymore

  2. Select 'Piera Systems IPS-7100' and save/restart. After restarting the firmware immediately crashes and goes into a reboot/crash loop.

Both situations can only be fixed by reflashing the FS with the config.

All other sensors appear to have some check that the sensor can be read or otherwise don't result in a lost sensor.

ricki-z commented 7 months ago

I will upload the actual version with PR #1022 to the alpha folder. If this is working I will copy it to the beta tomorrow.

ricki-z commented 7 months ago

Regarding the newer sensor types Next PM and IPS-710 I think the code needs some clean-up. I haven't had the time to check this thoroughly before merging ...

Phaze-III commented 7 months ago

If this is working I will copy it to the beta tomorrow.

The firmware from the alpha folder is running fine on my nodes, no crashes. I also tested some other languages than the usual ones and they are OK too.

OTA upgrades from the alpha-firmware also work, they will fetch the currently published beta (or release update) and loader and restart with the downloaded firmware.

So no objections from me to copy them to the beta tree :-)

ricki-z commented 7 months ago

Okay. My test sensor is running for 23 and a half hour now. The alpha will become the new beta in some minutes.

Phaze-III commented 7 months ago

Good news, thanks! I just did a successful OTA upgrade from the published stable firmware (NRZ-2020-133/NL (Nov 29 2020)) to beta. So hopefully more people can now help in testing the beta.

ricki-z commented 7 months ago

My test device is running for more than 43 hours now. Time to move the beta to stable? I will then make some cosmetic changes and increase the version numbers before publishing the stable version.

ricki-z commented 7 months ago

I've checked the size of the binaries. We may get a problem with updates in some languages where the total of the loader and the firmware is larger than the available flash memory (1GB) ... I will try language with the largest binary now. For future firmware releases we may need to 'optimize' the firmware again (i.e. by removing some of the ciphers again, do we really need the chacha20).

ricki-z commented 7 months ago

Okay, Bulgarian firmware as the largest is working. So we should be ready to go live with the actual beta and move it to the stable branch. (The stable version will be named NRZ-2024-135)

Phaze-III commented 7 months ago

Actually I tested an OTA update of all languages yesterday :) They all went fine. Note that the spiffs fs size where the binaries and config are stored is actually 3MB which leaves room for both the old and new firmware and the loader.

I'm still a bit worried about the RAM usage of the extra sensors (especially when new stuff is added in the future) but merging beta to stable should be OK (the few conflicts are easy to resolve).

I would suggest however to first copy the NRZ-2024-135 builds to the 'alpha' directory for a first preview and a few test runs of the builds. Every build still can show worse performance and compilation might need some tweaking. Don't expect that will be the case but better safe than sorry before updating 10K+ sensors :-)

ricki-z commented 7 months ago

For the update process both the loader and the new firmware need to be copied to the 1GB system flash. The memory usage during the update process is described here https://arduino-esp8266.readthedocs.io/en/latest/ota_updates/readme.html#update-process-memory-view

Phaze-III commented 7 months ago

When I was testing the OTA-problem last year I used some extra code to list the size and contents of the spiffs partition. That shows something like this after an OTA update:

airRohr: NRZ-2023-134-B5/EN
mounting FS...
opened config file...
parsed json...
File system info:
Total space:      2949250 bytes
Total space used: 950537 bytes
Block size:       8192 bytes
Page size:        2949250 bytes
Max open files:   5
Max path lenght:  32

Files found:
/loader.bin - 311632
/firmware.old - 626736
/config.json.old - 1510
/config.json - 1510

A next update will first store the new firmware binary there before flashing it to the 1MB program partition. So even with the larger binaries there still will be enough room.

Phaze-III commented 7 months ago

For the update process both the loader and the new firmware need to be copied to the 1GB system flash. The memory usage during the update process is described here https://arduino-esp8266.readthedocs.io/en/latest/ota_updates/readme.html#update-process-memory-view

Ah, okay. So we're close to exhausting the 1M space. Is the current code actually using the described OTA Update process?

ricki-z commented 7 months ago

This is the reason we use the two step update. Otherwise we would have only 0.5 MB for the firmware. And we haven't found a way to 'resize' the file system without 'killing' most of the devices.

ricki-z commented 7 months ago

The master build with the actual version is uploaded to the alpha folder.

Phaze-III commented 7 months ago

Thanks. The firmware from the alpha folder has been running fine overnight on my test node with the same performance characteristics (sample rate, memory usage and heap fragmentation, web ui response time, upload time and working auto update) as the beta. So the 'alpha' builds are OK with one minor intl detail for the CZ version.

I used a local compile of the master branch on my 'production' node with the CZ version and noticed (that is, my script to collect the statistics did) that the definitions for INTL_NUMBER_OF_MEASUREMENTS and INTL_TIME_SENDING_MS are now identical:

intl_cz.h 
#define INTL_NUMBER_OF_MEASUREMENTS "Počet měření"
#define INTL_TIME_SENDING_MS "Počet měření"

Google translate suggests #define INTL_TIME_SENDING_MS "Trvání odesílání dat" . Perhaps you could change that before publishing.

All in all I think we're good to go :-)

Phaze-III commented 7 months ago

For future firmware releases we may need to 'optimize' the firmware again (i.e. by removing some of the ciphers again, do we really need the chacha20).

ChaCha20 should be less CPU-intensive than AES-GCM based ciphers. Not sure about how that works out on an ESP-board. Furthermore it is a very common cipher used on strict HTTPS servers. For example on the forum there is someone using his own API-server using the following setup server-side:

PORT    STATE SERVICE
443/tcp open  https
| ssl-enum-ciphers: 
|   TLSv1.2: 
|     ciphers: 
|       TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 (secp256r1) - A
|       TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (secp256r1) - A
|       TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (secp256r1) - A
|       TLS_DHE_RSA_WITH_AES_256_GCM_SHA384 (dh 2048) - A
|     compressors: 
|       NULL
|     cipher preference: server
|   TLSv1.3: 
|     ciphers: 
|       TLS_AKE_WITH_CHACHA20_POLY1305_SHA256 (ecdh_x25519) - A
|       TLS_AKE_WITH_AES_128_GCM_SHA256 (ecdh_x25519) - A
|       TLS_AKE_WITH_AES_256_GCM_SHA384 (ecdh_x25519) - A
|     cipher preference: server
|_  least strength: A

Uploading over HTTPS to the custom API on that server only worked with the beta. I would keep the ChaCha20. Will it save much by disabling it?

ricki-z commented 7 months ago

I will go through the ciphers list.

A quick comparison:

actual cipher list in beta RAM: [==== ] 41.8% (used 34220 bytes from 81920 bytes) Flash: [======= ] 67.1% (used 701191 bytes from 1044464 bytes)

BearSSL Basic (actual cipher list in latest published stable) RAM: [==== ] 41.8% (used 34220 bytes from 81920 bytes) Flash: [======= ] 65.1% (used 680247 bytes from 1044464 bytes)

ricki-z commented 7 months ago

I've seen some parts where we may be able to save memory. I.e. the display part or the newly added sensors (NextPM, IPS-7100).

ricki-z commented 7 months ago

The new stable (master) version NRZ-2024-135 is now live.

Phaze-III commented 7 months ago

Thanks Rajko! I've seen already a bunch of sensors auto-updating and still running fine. Do you monitor the distribution of installed versions (i.e. based on uploads or firmware requests) ?

Phaze-III commented 7 months ago

One of the nice things about this release is that now more people can try the Power Save feature. I've been running my sensor in power save mode most of the time for almost a year now. The web-interface feels a little bit more sluggish (response time ~200ms instead of ~30ms but still very acceptable) but otherwise no real issues.

In my case the power consumption is almost half of the normal power consumption (~450 mW instead of ~900 mW):

Screenshot 2024-04-07 at 22 18 07

Now if all 10K+ sensors would run in Power Save mode that would save the equivalent of the power consumption of 15 average Dutch households :-)

Petarkir2000 commented 7 months ago

Two stations 24 hours with new 135 version are working fine. The problem occurred on one of them with previous beta version- didn’t want to update to the new one. After re-flash with non-beta, OTA was done.

Phaze-III commented 7 months ago

The problem occurred on one of them with previous beta version- didn’t want to update to the new one.

That is expected behaviour. The Auto-Update of the previous beta (NRZ-2021-134-B4) is broken. People running NRZ-2021-134-B4 can only update with a re-flash.

dokape commented 7 months ago

Hi, long time no see,

My sensor did an update as expected during the night. Checked now and have an issue.

I have DHT22 and BME230 both. After the update, the values of BME280 are not shown anymore. Just the DHT22 Sensors. Reboot, deactivate sensor and reactivate had no success.

jmparatte commented 7 months ago

Congratulations! Mine has automatically updated about 4h past, OpenSenseMap is now reactivated.

Phaze-III commented 7 months ago

@dokape I don't have a DHT22 sensor so can't check but did you check what happens if you only activate BME280 and let it run for a while after a save/reboot?

dokape commented 7 months ago

Yes, was not recognized. The log is maximum loglevel, no DHT22, Just BME280.

airrohr, FW 134.txt Screenshot 2024-04-08 140017

I installed the airrohr after moving to another house 4 years ago and had no problem but 2 times it stopped. A Power-Reset did restart the sensor and it worked fine again. The last stable war really a very stable version!

Edit: Power Reset did not change the behavior. BME values are gone.

Phaze-III commented 7 months ago

Is it possible for you to capture the debug information just after the restart? It should show something like:

Read SDS...: 
Stopping SDS011...
Read BMx280...
Trying BMx280 sensor on 77 ... not found
Trying BMx280 sensor on 76 ... found
Send to :
sensor.community
Madavi.de
dokape commented 7 months ago

Ah, i think I got something. it looks belonging to translations. I'm investigating.

dokape commented 7 months ago

Translation issue.

different screens German - English. Different Values: German: BMP, English: BME, differenz amount of values.

german-BME-missing

english-morre values

english-BME_works

Phaze-III commented 7 months ago

I can't really explain that from the code. Somehow it looks like in your case the German version detects a BMP280 instead of the actual BME280. On my sensor the German (and all other versions) correctly detect and show a BME280 . This is what I get when selecting both sensors (but as said I don't have a DHT22 connected):

Screenshot 2024-04-08 at 14 40 40

Do you get both sensors again when also selecting DHT22 using the English version?

dokape commented 7 months ago

english: DHT and BME works fine together.

This sensorhardware is about 6 to 8 years old. worked fine so far. Just the DHT22 has wrong humidity since some years. Wasn't there an issue about some weird IDs for the BME some years ago?

engl-BME-DHT_works

Phaze-III commented 7 months ago

Did you or could you try switching back to the German version again? I wouldn't be surprised that it will detect the BME280 again properly.

dokape commented 7 months ago

switched back to German and …

You are for shure not surprised:

It works fine

I know, I am the one to find strange behaviours. As always.

I guess it would be no fun to reproduce this behaviour.

IMG_6596

Phaze-III commented 7 months ago

Most likely some issue with registers not cleared correctly during the update. Might also have been solved with removing power for a few minutes and powering back up. Thanks for reporting!

dokape commented 7 months ago

The power reset did not work. thanks for reading!

ricki-z commented 7 months ago

@Phaze-III You've asked for some stats: https://stats.sensor.community/scripts/active_sensors.php (sensor active in the last 5 minutes, was around 11.650 devices yesterday before publishing the update) https://stats.sensor.community/scripts/firmware_versions.php (installed firmware versions, we are mostly done ;-) )

Phaze-III commented 7 months ago

@ricki-z Nice statistics. Good to see that the update went/is going smoothly!

issteve commented 7 months ago

Just to let you know, the update killed my connection to the sensor for about an hour. I was no longer able to get on the web interface and didn't receive any data via API (madavi and my own in the local network). It was running fine - before the update for a couple of years now - with a DHT22 and a SHT3X (and a hardware connected but in the menu not activated SDS011). And while my other sensors did get back online within about 15 minutes, this took about an hour...

Is there any need for debugging the issue? Or is this behaviour acceptable as it solves itself within an hour?

Phaze-III commented 7 months ago

It happens sometimes. When a sensor detects new firmware on the server it starts downloading the firmware (the actual .bin file, the .md5 for checking for corruption, the loader and md5). Each of these downloads can fail and normally the sensor stops the update process, continues to operate normally and tries again after 24 hours.

It can however also crash during the downloads after which it will start the update process again after reboot. It can happen that such a crash/restart/try-again cycle happens a few times in a row. Most of the time one of those cycles will succeed. I haven't been able to pinpoint a cause but it tends to happen under less than optimal WiFi conditions (say below -80 dB with a lot of noise) or with older and perhaps more worn out hardware.

Another scenario is that during the first start after the update the sensor crashes with a Software Watchdog or Exception and might do that a few times in a row. It most of the times can also recover after a while but sometimes only a power reset (keep power of a few minutes) helps.

I would say it is acceptable.

GoetzGoerisch commented 7 months ago

~~My sensor still did not get the OTA. (NRZ-2021-134-B4/DE) Not listed in the two statistics published last week but with up to data publishing on sensor.community.~~

Sorry for the noise, overlooked https://github.com/opendata-stuttgart/sensors-software/issues/1020#issuecomment-2042528127

GoetzGoerisch commented 7 months ago

@ricki-z Thank you for the new FW, existing sensor running fine after a manual update, now also shown in your stats above. (NRZ-2024-136-B1 (Beta))

To have a backup I wanted to setup a new device. The flashing works fine with the airrohr-flasher. Although I cannot initially set a Wifi PW, as the new screenshots indicate.

Secondly the sensors comes up in AP mode, with airRohr-<ID> I cannot access the wifi, as it asks for a pw which I never set.

Found it in some other issue: https://github.com/opendata-stuttgart/sensors-software/issues/916#issuecomment-904079624

Thank you.