thinger-io / Arduino-Library

IOTMP Arduino Library for connecting devices to thinger.io #IoT
https://thinger.io
MIT License
111 stars 66 forks source link

"SOCKET timeouts" causing lockups of an entire device when communicating with backend. #48

Open chrisbloomfieldcollie opened 1 year ago

chrisbloomfieldcollie commented 1 year ago

I am using currently using the thinger library on about 20 MKR NB devices, that connect over LTE-M or NBIOT, which is soon about to jump up to 80 devices. For that reason I desperately need a solution to this problem that I am having.

Basically, there seems to be three scenarios where the thinger library causes my devices to lockup and the only way to recover them is to use a watch-dog timer and reset the devices when detected. This has been an OK solution until now however it is happening so frequently (once per hour per device on average) that it effects the the battery life of my devices as they need to go through the startup sequence every time.

These three different errors that I get in this scenario are "Writing bytes [FAIL]", “[_SOCKET] cannot read from socket!” and "[_SOCKET] Timeout!". All cause my device to lock up indefinitely. Screenshots are attached.

The SOCKET timeouts seems to happen more frequently in some of the afternoons. It seems like then there are more people in our office building and potentially in the buildings around us (more devices connecting using the network?)

It has been a problem the whole time I have been using this library with this device but I solved it temporarily with a watchdog reset.

Someone else also seems to have had a similar issue when using the GSM version on the MKR https://community.thinger.io/t/mkr-gsm-1400-losing-connection-to-thinger-io/2991

Can someone help with this issue ASAP as it is causing us a lot of downstream problems with our product.

Thanks!

Screen Shot 2023-04-06 at 20 13 25 Screen Shot 2023-04-06 at 20 21 07 Screen Shot 2023-04-09 at 11 54 35 Screen Shot 2023-04-06 at 20 24 34 Screen Shot 2023-04-06 at 20 01 00 Screen Shot 2023-04-06 at 21 04 27 Untitled

colinvdspek commented 12 months ago

I have some of these NB boards too and have the same issue! Would love to know how to fix this.

alvarolb commented 12 months ago

Hi, do you need your devices to be permanently connected to the platform?

NB devices use to sleep most of the time, then wake up, connect to the internet, and transmit data, especially if they are powered by batteries.

Building reliable NB-IOT solutions requires some more engineering according to the specific use case, and probably the general-purpose Arduino library for thinger.io is not the best approach here.

chrisbloomfieldcollie commented 12 months ago

Hi @alvarolb

Thanks for your reply.

Yes, we do indeed need to be connected to the platform permanently. Realise the NB device is not a good system to be using long term but we chose it so that we could get our system up and running as fast as possible and iterate quickly from there. The reliability doesn't need to be perfect but right now we are dependent on getting or current solution working as best we can so that we can demo it for an investment round. For that reason we would love to find a viable workaround or solution.

The requirements are:

Currently the problem is that the device loses connection so regularly that it needs to be reset with a watch dog timer so many times that it is unpractical and burns extra battery. Could you point me in a rough direction to try and fix this lockup? Like how could I get it to try again if I get this socket fail error?

I haven't tried to monitor the connection with AT+CEREG yet but I will try that. Could I then easily trigger a reconnect if I detect it has been lost?

We have tested the library without peripherals a while ago but will try test in the same scenario we are getting these issues.

I have a basic RTOS in place yes. There are not a crazy amount of tasks although the GPS task can take up to 100ms. What is the maximum time you would recommend between handle() runs?

All of our NB SARA chips get the latest firmware version (at least I think, it's L0.0.00.00.05.08,A.02.04) before we use them.

Thanks in advance!

alvarolb commented 12 months ago

Please, review the firmware as I think the latest is 05.12.

I have read many issues regarding the MKRNB1500 stability, especially when the modem hangs. In the meanwhile, I have released a new Arduino Library 2.26.0 to try to improve the connection stability. It has not been tested properly, so, try it and let me if it improves something.

I have a basic RTOS in place yes. There are not a crazy amount of tasks although the GPS task can take up to 100ms. What is the maximum time you would recommend between handle() runs?

100ms will not be a problem. You can call practically at any rate under a minute. But It will make the device less responsive to API requests, i.e., calling it every 5 seconds, you can expect a 5 seconds delay when calling a device function.

chrisbloomfieldcollie commented 11 months ago

Hi @alvarolb

Thanks for the info, we have tried upgrading the firmware to 0.5.12 (was a mission) but it does not solve the issue.

The new version of which library exactly? How do I find it?

Thanks!

alvarolb commented 11 months ago

Hi, I released a new Arduino library for Thinger.io with version 2.26.0. Update it via Arduino IDE.

chrisbloomfieldcollie commented 11 months ago

Hi @alvarolb

Just to update you, we have updated to the latest library version and we are still getting the same errors. Is there anything/anywhere you could point us to so that we could try and get to the bottom of this issue ourselves.

Thanks in advance.

Chris

alvarolb commented 11 months ago

Hi @chrisbloomfieldcollie,

I have an MKRNB1500 here and will test it today. Just curious, what is your network provider?

alvarolb commented 11 months ago

Just received an MKRNB1500 and have it connected with a basic sketch. I'll update you on its performance. Have you experimented with different SIM cards or antennas?

Image

On another note, I've come across some issues related to the MKRNB1500, with numerous customers reporting errors, firmware problems, and hangs. It's concerning that Arduino doesn't seem to maintain or support this hardware, and there are no responses on their forums.

At thinger.io, we're using custom NB-IOT hardware based on ESP32 and Quectel BC660K for two different projects. Is there a specific reason you need to use the MRKNB1500? Perhaps we could explore alternative options.

chrisbloomfieldcollie commented 11 months ago

Hi @alvarolb

Our network provider is KPN here in The Netherlands. We experimented with Tele2 but found KPN to be more reliable. We haven't experimented with antennas yet. Is there anything you would recommend?

Aware of the issues with the MKRNB, we also have problems with the device locking up and we have built circuitry in our device to perform a hard reset of the SARA module when we detect this issue and that seems to fix it. The problem outlined in this thread though I am reasonably certain that it is a software issue on the arduino side (and I think in the thinger library) as it is fixed buy just a software reset on the arduino.

We have chose the MKRNB systems for their speed to develop on for the particular prototypes we are building. We need these to work until the end of October so we can get investment and then we will look for more reliable alternatives so would be happy to discuss your solution then.

Curious on the results from your testing with the MKRNB?

Chris

alvarolb commented 11 months ago

I think it is not a problem with the Thinger.io Arduino library, but a bad implementation on the MKRNB libraries, those that are responsible for talking to the modem via AT commands. You can make your own tests: just create a simple sketch with other protocols, like MQTT, and check how it behaves. Looking at the number of issues on the forums with the MKRNB1500 (that are not using tinger.io), I am certainly sure the library is stuck somewhere else waiting for a response from the modem or something similar.

In my first attempt, the MKRNB1500 was connected for 8 hours, then disconnected. Will keep checking it those days.