sparkfun / SparkFun_u-blox_GNSS_Arduino_Library

An Arduino library which allows you to communicate seamlessly with the full range of u-blox GNSS modules
Other
224 stars 100 forks source link

GNSS Logger Logging Failure #137

Closed derekpickell closed 2 years ago

derekpickell commented 2 years ago

Hi all. I've encountered an issue where my Artemis - ZED-F9P system will unexpectedly and permanently freeze/fail while logging RAWX/SFRBX messages. It has been incredibly hard to reproduce this failure (occurred 16 hours into a logging session, and after 12 hours a second time), and I'm having a hard time narrowing down where the root cause may be coming from, so I figured I'd ask some of the experts here.

Steps to reproduce

Unfortunately the first time this issue appeared I was not outputting any debug to serial. However, the behavior is qualitatively similar to this issue https://github.com/sparkfun/OpenLog_Artemis_GNSS_Logger/issues/8 in that the STAT Led was no longer flashing but the power LED was on. I also had all constellations enabled and NavigationFrequency = 1, whereas in the past I only ever used GPS L1/L2. The second time around I output "important GNSS messages", which read "checkUbloxI2C: I2C error: endTransmission returned 4" ad infinitum.

Like I mentioned, it's been hard to debug because of the difficultly in reproducing the issue. There are quite a few variables I'd like to test (number of constellations, I2C clock speed, revert to GNSS library version 2.0.15, eliminating load on Artemis by disabling SPI/Serial streams)...

Thanks!

PaulZC commented 2 years ago

Hi Derek (@derekpickell ),

Thanks for reporting. I will try to replicate...

Which physical board are you using? The MicroMod Data Logging Carrier Board with Artemis Processor Board?

Which version of SdFat are you using? How is your SD card formatted (FAT32 or exFAT)?

Do you need to use 400kHz I2C? Could you make do with 100kHz?

The other thing to throw into the mix is the Apollo3 core version. I will give v2.2.1 a try first.

With Apollo3 v1.2.1, error 4 is "other error" (returned from am_hal_iom_blocking_transfer):

    case AM_HAL_STATUS_INVALID_OPERATION:
    case AM_HAL_STATUS_INVALID_ARG:
    case AM_HAL_STATUS_INVALID_HANDLE:
    default:
        return 4;

That could be tricky to diagnose, but I'll give it a go...

Best wishes, Paul

derekpickell commented 2 years ago

Hi @PaulZC,

Thanks for the fast reply. To answer your questions:

  1. I am using the Artemis Module with a ZED-F9 on a custom PCB, but similar architecture to the MMDLCB + Artemis combo.
  2. SdFat v2.1.0 with 32GB Sandisk extreme formatted to FAT32
  3. 100kHz should theoretically be OK given my 1Hz nav rate. It's on my list of "variables" to try to isolate and tweak to see if there is any improvement.
PaulZC commented 2 years ago

Hi Derek,

Thanks for this. I will try and get as close as I can with the hardware I have available.

I hate to say it, but this could be down to a hardware glitch…

I’ll let you know what I find.

All the best, Paul

PaulZC commented 2 years ago

Hi Derek,

I'm running a test using the attached code (see zip file below).

I'm using: v2.2.9 of this library; Apollo v2.2.1 (latest); SdFat 2.1.2 (latest); 16GB SanDisk "EDGE" card - FAT32 - freshly formatted using the SD Association formatter; MicroMod Data Logging Carrier Board; MicroMod Artemis Processor Board; ZED-F9P GPA RTK2 connected via Qwiic (100kHz, no pull-ups); a dual-band antenna with a good but not perfect view of the sky. The ZED is running F9 HPG 1.30 (latest) - protocol version 27.30.

I'm logging around 2-3KBytes/sec:

image

I'll let you know how it goes. I'll leave it for ~36 hours, unless I see it crash before then.

All the best, Paul

DataLoggingExample4_RXM_without_Callbacks_SdFat.zip

PaulZC commented 2 years ago

14 hours in and it is still chugging along nicely...

Note to self:

I'm not quite using a vanilla copy of Apollo3 v2.2.1. My copy includes paulvha's SPI end fix:

In libraries/SPI/src/SPI.cpp change end() to:

void arduino::MbedSPI::end() {
  if (dev) {
    delete dev;
    dev = NULL;
  }
}      

(The dev = NULL is important.)

PaulZC commented 2 years ago

Hi Derek (@derekpickell ),

No signs of badness with this test...

image

image

I re-started the test once, after ~100kBytes, which is why the "bytes written" doesn't quite match. The logged data is completely clean.

I'll give 400kHz a try. My money's on that being the cause...

Best wishes, Paul

PaulZC commented 2 years ago

Hi Derek (@derekpickell ),

Sorry. No clues here. I left the 400kHz test running for almost 48 hours and the data is completely clean:

image

image

Just to summarize, I was using:

I have of course seen I2C badness in the past, especially on Artemis, especially at 400kHz, especially with pull-ups enabled. I'll try a quick test with the pull-ups enabled just to see if I can replicate your issue.

It seems more likely that your issue is caused by the Apollo3 core, not this library. But I'm happy to try to help you debug the issue - if I can.

All the best, Paul

PaulZC commented 2 years ago

OK. With the Artemis pull-ups enabled, the ZED board pull-ups disabled, I see bus errors at 400kHz:

image

At 100kHz, the checksum errors are less, but still present:

image

PaulZC commented 2 years ago

I've downgraded the ZED to HPG 1.13. I'll do a test using that, just to see if I can replicate your I2C endTransmission error.

PaulZC commented 2 years ago

Ah ha! We might have a winner!!

image

image

For this test, I was using:

The endTransmission errors appeared only an hour or so into the test. I didn't see exactly when it happened.

I don't know if this is the smoking gun we're looking for, but it is certainly mighty suspicious! Log SFRBX and RAWX at 1Hz with HPG 1.30 with 400kHz I2C for almost 48 hours. No errors. Switch to logging RAWX only at 2Hz with HPG 1.13 with 400kHz and I got a failure within approx. an hour... The HPG version appears critical.

Can you please confirm which version of HPG your ZED is running? This tutorial may help.

Best wishes, Paul

derekpickell commented 2 years ago

Wow great find!! That certainly seems quite suspicious. Looking at my Sparkfun ZED (SMA) boards, I see HPG 1.12 while my custom modules have HPG 1.13. I'll update all to HPG 1.30 and run a trial as soon as wrap up my current test—currently running at 100kHz 24+ hrs now without issues. So something seems up with the 400kHz + HPG 1.13 combo... -Derek

derekpickell commented 2 years ago

My combo of 100kHz I2C and HPG 1.30 has been running on 2 boards for ~48 hours now and both happily blinking away. I feel comfortable saying that the firmware update fixed the issue (the Ublox release notes are very vague about what I2C improvements were made under the hood and it is strange that HPG 1.12 doesn't present any problems either). Thanks for the help!

PaulZC commented 2 years ago

No problem Derek - glad that's working for you!

Please close this issue once you're happy.

Very best wishes, Paul