sparkfun / SparkFun_u-blox_GNSS_Arduino_Library

An Arduino library which allows you to communicate seamlessly with the full range of u-blox GNSS modules
Other
217 stars 99 forks source link

Communication fails with multiple messages #199

Closed Jakub-Nagy closed 8 months ago

Jakub-Nagy commented 1 year ago

I have been dealing with this issue for over a month. We have a device which is impossible to update firmware on (stratospheric probe) and the conditions are difficult to reproduce in lab conditions. But we know this issue is caused by firmware, because previously the same hardware setup worked without an issue. We are using a ZOE M8Q receiver.

The last working build used:

On the application layer we have a system which diagnoses the GNSS state. With this setup we had a 95% fix rate with a 1% init failure rate.

Because we wanted to have additional information about satellites (number of satellites in view), we tried:

However we try to configure the device and even if it works during ground tests, any of these configurations produce around 50-90% init fail (when the receiver.begin() function fails repeatedly) and communication fail (when the init succeeds but then the receiver stops to respond for 5 seconds or more).

BTW the PVT message is 92 bytes and the NAV SAT message is 428 bytes at 35 satellites. Which should be completely fine with UART at 9600 baud at a 1Hz rate, or?

Honestly at this point I'm quite desperate as to where this issue is rooted and I'm thinking the only solution might be reverting to the original firmware with less features, throwing away around 100 hours of work.

Do you have any idea as to what might cause this and/or how to approach fixing it? Thanks!

PaulZC commented 1 year ago

Hi Jakub (@Jakub-Nagy ),

I suspect you are overloading the ZOE's UART port, asking it to output more data than the baud rate can support. 9600 baud should allow you to transfer approximately 960 bytes per second. But that does not include any overheads from polling.

Could you also be overflowing your microcontroller's serial RX buffer? Please see issue #198

The RXM-RAWX messages can be very large. They are regularly over 2kBytes on the dual-band ZED-F9P. On the ZOE, I guess they could still be approximately 1kByte?

Please also check that you have disabled the NMEA messages correctly. If those are still enabled, you will be overloading the UART port.

Are you setting the dynamic model correctly? For stratospheric work you need to use one of the "Airborne" models (1g / 2g / 4g depending on your needs).

Please tell me exactly which messages you are enabling and at what rate (navigation rate and baud rate) and I will try to replicate your issue.

Best wishes, Paul

Jakub-Nagy commented 1 year ago

9600 baud should allow you to transfer approximately 960 bytes per second.

Shouldn't it be 9600 / 8 = 1200 bps?

But that does not include any overheads from polling.

That's why I choose polling over autoreporting. If the GNSS reports let's say 4 messages per second at 9600 baud the communication simply fails and the MCU can't process anything. But if I try to poll at a rate of 4 messages per second I still successfully get 1 or 2 messages and the other 2 fail, which is highly preferable.

Could you also be overflowing your microcontroller's serial RX buffer? Please see issue https://github.com/sparkfun/SparkFun_u-blox_GNSS_Arduino_Library/issues/198

Good call. I'm using STM architecture and as far as I know there's no buffer size unless using FIFO. But I investigated and even found out there was a memory leak causing halting of the CPU, which is even more critical than the GNSS not functioning. Calling setPacketCfgPayloadSize() and setting it to UBX_NAV_SAT_MAX_LEN fixed the issue in lab conditions.

I'm still not sure as to why this could happened, since the original setting of UBX_RXM_MEASX_MAX_LEN = 2252 bytes would still allow for 212 sat blocks. But in the library UBX_NAV_SAT_MAX_BLOCKS = 255. Do you know why? I couldn't find this in the documentation. It is a 72 channel receiver and in testing the code failed even if the actual (reported) numSvs = 18.

Please also check that you have disabled the NMEA messages correctly. If those are still enabled, you will be overloading the UART port. Are you setting the dynamic model correctly? For stratospheric work you need to use one of the "Airborne" models (1g / 2g / 4g depending on your needs).

Also good points but I had this set correctly.

Anyways it seems that it is working now in lab conditions (but it was many times before). I will reply again when I test it in the real (flight) conditions.

PaulZC commented 1 year ago

Hi Jakub,

To convert baud rate to bytes per second, you need to divide by 10: 8 data bits plus the start and stop bits. My rule-of-thumb is to try to never run the bus at more than ~80% capacity. The u-blox modules are very clever and will gracefully drop the message rate if it calculates that you are asking for more data than the interface can support.

You might want to investigate UBX-MON-TXBUF. It contains information about how much data the transmit buffer contains, peak usage, etc.. The SparkFun library doesn't support it directly, you would need to poll it using a custom command, but it might help your diagnosis.

We've tried to make sure the library can support dual-band modules like the ZED-F9P as well as single-band modules like the ZOE. The ZED can generate a lot of NAV-SAT or RXM-RAWX data when tracking all four constellations on dual bands... But we've also tried to keep the library running on the original ATmega328P, which has a massive 2K of RAM available... It's a bit of a juggling act. If you have plenty of RAM available, my advice would be to allocate plenty of payload space, just in case.

I'll leave this issue open for now. Please just ping me if / when you need more help. I had a lot of fun in my past role, tracking high-altitude balloon payloads. Back then I was using the MAX-M8Q and transferring the data via Iridium Short Burst Data. Happy to help you if I can.

Best wishes, Paul

Jakub-Nagy commented 1 year ago

Hey Paul

To convert baud rate to bytes per second, you need to divide by 10: 8 data bits plus the start and stop bits. My rule-of-thumb is to try to never run the bus at more than ~80% capacity. The u-blox modules are very clever and will gracefully drop the message rate if it calculates that you are asking for more data than the interface can support.

Okay cool that makes sense.

I'll leave this issue open for now. Please just ping me if / when you need more help. I had a lot of fun in my past role, tracking high-altitude balloon payloads. Back then I was using the MAX-M8Q and transferring the data via Iridium Short Burst Data. Happy to help you if I can.

That sounds fun, thanks!

Everything worked in ground testing without an issue. Sadly we launched the balloon and the firmware halted again. This could be caused by almost anything but I still suspect the GNSS library and/or app layer of it. But I have no idea how to investigate this, especially when it didn't occur during testing. Anyway the GNSS worked fine during the day, but produced more and more INIT_FAIL states after sunset, when the temperature dropped:

Screenshot 2023-06-20 at 21 35 13

(when SIV is 0 there is no FIX)

So it would seem the low temperature cause the communication issues. But the GNSS still occasionally fixes, even at temperatures of -35C. What's more, during lab testing the GNSS produced no communication fails in a freezer at -80C. Of course in this case there is no satellite reception test so the GNSS receiver is not communicating much information.

This is the initialisation sequence:

Screenshot 2023-06-20 at 21 42 44

Note that setting the message format, navrate and dynamic model is done only after MCU reset. If any of the functions returns false, the system reports GNSS INIT_FAIL and shuts off. If there are 2 consecutive init fails, the MCU soft resets (this seems to help in ground testing, but it's more solving the symptom than the core issue).

At deinit, I always call receiver.end(); in order to avoid any leaks.

Do you see any issues with the init itself? Should I set all parameters each time? Shouldn't they be saved in the flash?

Thanks! Best regards, Jakub

PaulZC commented 1 year ago

Hi Jakub,

I'll take a better look at this in the morning.

Just a quick question about your batteries / power supply:

What batteries are you using? Could the voltage be collapsing at low temperature? How are you powering the ZOE backup power pin? If you are powering it from a small rechargeable cell - like we do on the SparkFun boards - could its voltage be collasing at low temperature?

I've run into battery-temperature issues over the years. I do like Energizer Ultimate Lithium AA / AAA / PP3 cells. They work well when cold. But the AA / AAA voltage reduces from 1.5V per cell to about 1.1V per cell at -50C. If you are using three cells in series, your 3.3V rail may start to collapse overnight. See section 2.3.7 here: https://arxiv.org/abs/1904.04321

More tomorrow, Paul

Jakub-Nagy commented 1 year ago

Hey Paul, Thanks for the quick reply.

What batteries are you using?

It's a custom low temperature rechargeable Lithium based battery from Grepow. It's charged by a PMIC from solar cells.

Could the voltage be collapsing at low temperature?

I don't think so. The PMIC (AEM10330) has a DCDC buck-boost converter which is keeping the voltage at 3.3V through the whole range of the battery and source voltages. Of course there could be more introduced, but it's not like it should drop below 3V.

How are you powering the ZOE backup power pin? It's connected directly to the output of the DCDC regulator, meaning it's almost constantly on. The GNSS VCC is gated by a load switch controlled by the MCU.

There is power monitoring included in the FW, meaning the GNSS doesn't startup at all when the battery SoC is below 25% (around 3.7V) and it cuts off the power immediately when battery voltage dips below 3.4V. So it's unlikely that power is the issue.

See section 2.3.7 here: https://arxiv.org/abs/1904.04321

Nice paper, thanks for sharing! We also experimented with various different power sources (primary batteries, supercaps, rechargeable batteries) and these seem to support the rest of the system at night as well, but the GNSS communication is a big issue now.

BTW we had 34 launches, you can see them here, although not all diagnostic data is visualised: https://app.picoballoon.org/

Thanks again, Jakub

Jakub-Nagy commented 1 year ago

I don't want to go too much out of scope here, since I know this is an Arduino library and I just slightly modified it, but the same principles should apply for Arduino code as well.

Anyway I thought sharing my UART config could help:

Screenshot 2023-06-21 at 10 16 08

Each byte is read individually by the library just as in the original Arduino implementation, aka there is no DMA or UART interrupts (perhaps there should be? but I don't know how to integrate that into the flow of this library).

I read that perhaps enabling one bit sampling could help? Or maybe using FIFO?

BTW, the MCU is clocked at 1MHz with an internal oscillator which is calibrated by PLL from an external 32.768kHz crystal with 20ppm tolerance, so timing shouldn't be an issue.

So maybe these details can help investigate the issue. Thanks.

Best, Jakub

PaulZC commented 1 year ago

Hi Jakub,

If your microcontroller crystal frequency drifts with temperature, and (perhaps) the ZOE baud rate drifts with temperature too, errors may start to appear in the serial data due to the different rates. It would be very interesting to monitor the TX from the microcontroller and the TX from the ZOE with an oscilloscope. At cold temperatures, do the baud rates still agree? Maybe you only get errors when the module is outputting longer messages?

If the micro is running at 1MHz internally, and you are oversampling the UART RX 16 times, you are sampling the data at 62500 Hz. That should be perfectly OK for 9600 baud. I don't know what "one bit sampling" is, but perhaps it is worth investigating. Also I don't know how the microcontroller treats timing errors. If it sees a "1" or a "0" which is 15 or 17 samples wide, does it treat that as an error? I'm just guessing, but perhaps reducing the number of oversamples may help?

If the FIFO is the same as a RX Buffer, then I would strongly recommend using that. If you miss or skip a byte part way through a message, you will get checksum errors and the message will be discarded.

I hope this helps, Paul

PaulZC commented 1 year ago

Unless the oversampling works the other way... If the oversampling depends on the baud rate and the baud rate is 9600 and the oversampling is 16, is it sampling at 153600Hz? Is that worryingly close to your 1MHz clock frequency? Something to investigate...

Jakub-Nagy commented 1 year ago

Hey Paul,

It would be very interesting to monitor the TX from the microcontroller and the TX from the ZOE with an oscilloscope. At cold temperatures, do the baud rates still agree? Maybe you only get errors when the module is outputting longer messages?

I could try but as I said there are no problems in lab conditions, or in cold temperatures. So I would have to keep the board cold, sample the UART lines with a oscilloscope and provide satellite signal (either do everything outside or have an external antenna on a long cable).

I don't know what "one bit sampling" is, but perhaps it is worth investigating.

This is what I found:

Enabling UART_ONE_BIT_SAMPLE_ENABLE configures the UART peripheral to sample the received data at a single bit period. This means that the receiver samples the data closer to the edge of each bit, providing a more accurate measurement of the signal.

Unless the oversampling works the other way... If the oversampling depends on the baud rate and the baud rate is 9600 and the oversampling is 16, is it sampling at 153600Hz? Is that worryingly close to your 1MHz clock frequency? Something to investigate...

I think this is closer to how it works, which could cause a problem.

In the end I found out I just wasn't careful enough with the FW. I was debating the value of the OverSampling when in fact it shouldn't be set at all for the LPUART. Oversampling is not available for the low power UART on STM32. So that's my bad, it was added mistakenly during the development process. But I'm almost sure this is not the source of all the issues. Either way I'll test and report back.

I found a different issue. When the balloon is flying over the sea, it looks like this:

Screenshot 2023-06-21 at 17 56 37

What you can notice is when the balloon flies over land, the GNSS fixes precisely, with a time to fix under 5s, 10+ satellites in use and 20+ satellites in view. When it flies over sea, it struggles to fix at all, reporting fix timeout (timeout is at 55s) or when it does fix, there is a horizontal and vertical deviation of kilometres if not tens of kilometres.

Did you see anything like this before? Could this be noise? Reflectance of some signals from the water? Incorrect dynamic model (as far as I know that doesn't relate to land/water)? Spoofing?

This pattern at least looks a bit like spoofing (or I don't know how it could appear randomly):

Screenshot 2023-06-21 at 18 07 59

I'm kinda puzzled as to the cause of this, if you have any insight, let me know. Thanks!

Best, Jakub

PaulZC commented 1 year ago

Hi Jakub,

What type of antenna are you using? Is it a chip antenna and, if so, how is it oriented? Does it have a PCB ground plane?

I have not seen this before. But, if I had to guess, I would say it does look like a signal reflection issue.

I suggest asking about this on the u-blox support portal. It might be something someone has seen before.

Best wishes, Paul

Jakub-Nagy commented 1 year ago

Hi Paul,

What type of antenna are you using? Is it a chip antenna and, if so, how is it oriented? Does it have a PCB ground plane?

We have a 1/4 wire antenna angled 45 degrees to the sky in order to receive sats in the arc between the horizon and zenith. A ground loop in the solar panel serves as the ground plane but it's only about 8cm wide.

I have not seen this before. But, if I had to guess, I would say it does look like a signal reflection issue.

I thought so too. The only thing relating to multipath/reflectance I can find in the M8 receiver description is the UBX-CFG-NAVX5 message and the sigAttenCompMode config. I could also try to force 3D fix and increase the minimum required sats, but that could make the situation even worse.

I suggest asking about this on the u-blox support portal.

Will do.

Best, Jakub

PaulZC commented 8 months ago

Hi Jakub (@Jakub-Nagy ),

I think this issue is stale? Closing... Please reopen if you need more help with this.

Best wishes, Paul