helium / gateway-rs

The Helium Gateway
Apache License 2.0
280 stars 110 forks source link

Data transfered too high #101

Closed disk91 closed 2 years ago

disk91 commented 3 years ago

After upgrading to alpha-16 the hourly communications has jump over 100MB/h. This is impossible to support on LTE communication at a reasonable cost.

madninja commented 3 years ago

This is odd as we’ve not even switched to using state channels yet. The new structure actually streams packets and responses back and forth but may receive an unused banner every time a new state channel is openend.

I don’t think the latter would cause that kind of traffic. I.e. for just packet flow this is no different than before except it actually works this time.

Are you able to characterize the device usage for those gateways? And can you confirm this data usage is not because of update retries?

disk91 commented 3 years ago

I've switched my gateway-rs on a backend server until this is solved. The Semtech traffic is 1,5MB from 0am to 7am compared to 100MB / hour. I don't think this is related to device traffic.

madninja commented 3 years ago
  1. What do you mean with "I've switched my gateway-rs on a backend server"?
  2. Could you help diagnose this by checking the logs for many "installing update" calls? I'm wondering if that small root volume is causing disk space to run out for updates and that it keeps trying to download the same update over and over again
disk91 commented 3 years ago

1 ) I have migrated my gateway-rs from the gateway itself to a hosted server in a data-center and connected the Semtech protocol to that server. 2) unfortunately I do not have logs from sunday anymore. But I assume we could have a such issue as I was not running the latest version when starting it up. But if I remembered well it was also growing after I did it. The location of the opkg download is also important, using /etc would not be good.

madninja commented 3 years ago

I'm pretty sure this was related to repeated upgrade download attempts which failed since there otherwise was no change to the packet routing logic. I believe that if you install alpha.17 on your gateway by hand you'd be back to the same traffic patterns you had before.

gateway-rs stores downloaded updates in /tmp before trying to install it. I don't know what your system has storage wise in that location but it would be good to know if that was the issue

disk91 commented 3 years ago

I've setup a new hotspot running version alpha-17. I'll need to monitor this and tell you more about it. It's not registred on the blockchain yet. Will this impact the test results ?

madninja commented 3 years ago

No it should not affect the test results. Note I've also just tagged alpha.18 to attempt a fix at an occasional devaddr matching issue

disk91 commented 3 years ago

at least I got this

Tue Sep 21 20:47:32 2021 daemon.info helium_gateway[8475]:  ERRO failed to install update Collected errors:
Tue Sep 21 20:47:32 2021 daemon.info helium_gateway[8475]:  * pkg_write_filelist: Failed to open //usr/lib/opkg/info/helium_gateway.list: No space left on device.
Tue Sep 21 20:47:32 2021 daemon.info helium_gateway[8475]:  * opkg_install_pkg: Failed to extract data files for helium_gateway. Package debris may remain!
Tue Sep 21 20:47:32 2021 daemon.info helium_gateway[8475]:  * opkg_install_cmd: Cannot install package helium_gateway.
Tue Sep 21 20:47:32 2021 daemon.info helium_gateway[8475]:  * opkg_conf_write_status_files: Can't open status file //usr/lib/opkg/status: No space left on device.
Tue Sep 21 20:47:32 2021 daemon.info helium_gateway[8475]:  * pkg_write_filelist: Failed to open //usr/lib/opkg/info/helium_gateway.list: No space left on device.
Tue Sep 21 20:47:32 2021 daemon.info helium_gateway[8475]: , module: updater
Tue Sep 21 20:47:32 2021 daemon.err helium_gateway[8475]: Error: IO(Custom { kind: Other, error: "Collected errors:\n * pkg_write_filelist: Failed to open //usr/lib/opkg/info/helium_gateway.list: No space left on device.\n * opkg_install_pkg: Failed to extract data files for helium_gateway. Package debris may remain!\n * opkg_install_cmd: Cannot install package helium_gateway.\n * opkg_conf_write_status_files: Can't open status file //usr/lib/opkg/status: No space left on device.\n * pkg_write_filelist: FaTue Sep 21 20:47:33 2021 user.notice lora_pkt_fwd[10735]: 
Tue Sep 21 20:48:02 2021 daemon.info procd: Instance helium_gateway::instance1 s in a crash loop 6 crashes, 0 seconds since last crash

alpha.18 has been uploaded automatically but not installed and still in /tmp directory. /tmp is not full

madninja commented 3 years ago

Ah ok, see your opkg install folder doesn't have enough space to extract and install gateway-rs updates. Which will likely cause it to try updating again. Which causes another download.. Do you have a (1) volume that has more space, and (2) did you make space by deleting the /etc/helium_gateway/cache folder?

disk91 commented 3 years ago

This is a fresh new rak, so not cache history Other issue : when upgrading, opkg remove previous setting.toml so region is lost There is space available on /usr/lib/opkg/info

root@RAK7240:/usr/lib/opkg/info# df -h .
Filesystem                Size      Used Available Use% Mounted on
rootfs                    3.4M      1.7M      1.8M  49% /
madninja commented 3 years ago

That's not enough space for an upgrade .. the binary for helium_gateway itself is already bigger than that

jmarcelino commented 3 years ago

We're trialling a cut down firmware (standard RAK firmware has too many functions like built in LNS etc which are not necessary for a Helium setup) which should free up more space.

disk91 commented 3 years ago

The ipk has been downloaded, this is not the problem it is in /tmp where 55.1M are still available problem is related to the automated installation.

Where processing the installation manually, (opkg install ... downloaded ipk) I got:

opkg install helium-gateway-v1.0.0-alpha.18-ramips_24kec.ipk 
Upgrading helium_gateway on root from 1.0.0-alpha.17 to 1.0.0-alpha.18...
Configuring helium_gateway.
Collected errors:
 * pkg_get_installed_files: Failed to open //usr/lib/opkg/info/helium_gateway.list: No such file or directory.
 * pkg_get_installed_files: Failed to open //usr/lib/opkg/info/helium_gateway.list: No such file or directory.
 * pkg_get_installed_files: Failed to open //usr/lib/opkg/info/helium_gateway.list: No such file or directory.
 * pkg_get_installed_files: Failed to open //usr/lib/opkg/info/helium_gateway.list: No such file or directory.

setting.toml has been replaced with region setting missing

currently ... but it seems to be running after this installation and these change on alpha-18 but automated process did crashed

disk91 commented 3 years ago

I can also add a consumption report for last 24h RX : 416MB TX : 6,74MB

madninja commented 3 years ago

I know it's not /tmp.. the root filesystem is too small during opkg installation?

And don't remove the region in settings.toml? otherwise you're region is incorrectly reported

madninja commented 3 years ago

And yes the data consumption is all about the updater continuously trying to upgrade

disk91 commented 3 years ago

And don't remove the region in settings.toml? otherwise you're region is incorrectly reported Updater is removing it.

I know it's not /tmp.. the root filesystem is too small during opkg installation? /tmp is ok and have enought space for getting the file on it. I'm able to execute the install manually but auto-installation is not working correctly

And yes the data consumption is all about the updater continuously trying to upgrade I don't understand why it is downloading all the time as the file is already on the file-system. this could be improved.

madninja commented 3 years ago

And don't remove the region in settings.toml? otherwise you're region is incorrectly reported Updater is removing it.

That's the first I've heard of this.. I'll have to try to reproduce it.. not quite the subject of this bug though

I know it's not /tmp.. the root filesystem is too small during opkg installation? /tmp is ok and have enought space for getting the file on it. I'm able to execute the install manually but auto-installation is not working correctly

That's good to know.. Is there any way you can get the installation logs after an auto-update attempt to see what went wrong?

And yes the data consumption is all about the updater continuously trying to upgrade I don't understand why it is downloading all the time as the file is already on the file-system. this could be improved.

Definitely! You're the only one that's reported this so far, so I'm not sure how I can reproduce this.

disk91 commented 3 years ago

Maybe I'm the only one running RAK7240 but, up to now I've been reproduced it on all the RAK I've been deployed (2 on 2). This little bug costs me 20€ of data in half a day. I'm happy to have discovered it before receiving the invoice ;)

madninja commented 3 years ago

As am I.. thankfully you're responsible for your own actions :-)

disk91 commented 3 years ago

This is just to let you understand that if the upgrade solution have a bug creating a cycle download consuming about 100Mb an hour, on a such product made to run on remote hotspot running 4G connectivity, The impact for users is not just a bug we solve by doing a restart ... It is also a really huge invoice at the end of the month, here we talk in K€ with a standard B2B 4G connectivity.

So : NO I'm not the only one having this problem, potentially it is related to RAK 7240 but any user having a RAK7240 is concerned.

As a consequence, we need to solve this issue. So let me know about the test I can do, the technical question you can have.

madninja commented 3 years ago

Of course. No one said there wasn’t a bug in the alpha software. Just that during alpha there will be bugs and that you take a risk running alpha software. I know you are ok running a diy setup with non released software, but others may not and they should wait for production light hotspots with support from vendors

You can help by identifying what partition does have space. And finding the auto opkg logs for the auto install failure.

If you can’t maybe having an ssh-able unit available for us to get into may help us diagnose the issue

disk91 commented 3 years ago

Available space in partitions:

Filesystem                Size      Used Available Use% Mounted on
rootfs                    3.4M      2.0M      1.4M  59% /
/dev/root                10.8M     10.8M         0 100% /rom
tmpfs                    61.7M      4.3M     57.4M   7% /tmp
/dev/mtdblock7            3.4M      2.0M      1.4M  59% /overlay
overlayfs:/overlay        3.4M      2.0M      1.4M  59% /
tmpfs                   512.0K         0    512.0K   0% /dev
/dev/mmcblk0p1           14.8G      1.4M     14.8G   0% /mnt/mmcblk0p1

I'll dig for opkg logs on next update, there is no history on this. Due to bugs I had to switch the primary hotspot to hosted approach and my second test unit is not remotly accessible.

madninja commented 3 years ago

ok,

  1. unfortunately opkg uses rootfs/overlay to store it's state and the executable is installed there too. 1.4M available is not enough for it to do its job. I'm hoping @jmarcelino can help with a new firmware release there

  2. I'll add some checking to ensure that the same update isn't downloaded over and over again to avoid at least using up a lot of data repeatedly downloading the same installer.

  3. As a workaround you can add an

[update] 
enabled=false

section to the end of settings.toml to turn of the self-updater.

disk91 commented 3 years ago

What I don't understand is : when I'm running opkg manually from /tmp it works. Thank you for the workaound. Basically it is also a secured way to manage device deployed in production against inpredictable behavior during the alpha phase

madninja commented 3 years ago

What I don't understand is : when I'm running opkg manually from /tmp it works.

Yeah I don't get that either. We need to get to your installation logs to see what is going on there. You could uninstall and install the previous version manually to see what happens during the auto update cycle

Thank you for the workaound. Basically it is also a secured way to manage device deployed in production against inpredictable behavior during the alpha phase

running alpha channel gets you the update feature yes.. using it in production will be risky. But even when we get to production channel you (or other makers) may still want to manage updates with docker images or specific bundled firmware releases

disk91 commented 3 years ago

Yeah I don't get that either. We need to get to your installation logs to see what is going on there. You could uninstall and install the previous version manually to see what happens during the auto update cycle

I will do that yes (next week)

madninja commented 2 years ago

With the off chain packet accounting hotspots will no longer go through the offer/purchase flow so the data rate will be a lot less. It'll always be higher than straight UDP frames from the packet forwarder since gateway-rs does hold packets if it can't deliver them which then includes hold_time as a field, as well as signing the packet to prove the source of the packet.