sparkfun / SparkFun_RTK_Firmware

Centimeter precision GPS/GNSS using L1/L2 signals broadcast over Bluetooth SPP (using the ESP32) in an easy to use enclosure.
https://docs.sparkfun.com/SparkFun_RTK_Firmware/
Other
81 stars 46 forks source link

Logging issue : a lot of files and logs drops #469

Closed Stefal closed 1 year ago

Stefal commented 1 year ago

Subject of the issue

We are using some RTK Express to log data when we walk or drive. Instead of one file for each trip, we have a lot of files, drops inside the files, and too much float or single fix.

Your workbench

Steps to reproduce

Expected behavior

Actual behavior

Stefal commented 1 year ago

new test with the firmware v3.0: The test was short, but I didn't have multiple ubx files. However there are some holes in the log image

pobsteta commented 1 year ago

Idem on RTK Facet with firmware 3.2

nseidle commented 1 year ago

@Stefal - We found a fixed a particularly bad bug on May 18th that was causing issues with corrupt log file names and unexpected resets. This bug was introduce on Feb 28 so v3.1 should have been good, while v3.2 and v3.3 have this bug.

Can you share some of the sections of the log where the RTK Fix is lost? Can you tell if the break is due to a system reset or does the system simply lose RTK Fix and then re-gain it?

Eric-FR commented 1 year ago

Same with a RTK Express with 3.2 firmware. Tracks are divided in numerous different files. Sometime, I can observe device reboot. I lost several tracks: while RTK was active and used for Mapillary, I can't find any file on the SD card.

image Foot walk with RTK Express recording NMEA+UBX at 1 Hz.

Stefal commented 1 year ago

Can you share some of the sections of the log where the RTK Fix is lost? Can you tell if the break is due to a system reset or does the system simply lose RTK Fix and then re-gain it?

I will wait for a new release with the various fixes you have mentionned, and redo some tests. I think that parts of my problems are related to the 5hz gnss frequency that the bluetooth stream can't handle. I forgot that log to file and bluetooth settings are the same, contrary to my old logger in which the stream are independant (1 or 2hz for bluetooth, and 5hz or more with many ubx messages for log to SD card).

Stefal commented 1 year ago

Test with the 3.4RC (RTK Firmware: v3.4-May 25 2023) :

Result:

image

I will share the logs with you in private (email)

nseidle commented 1 year ago

I've replicated the issue. An overnight run produced ~30 logs. The unit is closing old logs and opening new ones because no new data is being received for more than 5 seconds:

image

I haven't yet figured out why the ESP32 thinks the ZED is not outputting data. TBD!

nseidle commented 1 year ago

Serial bytes are all being read correctly even when the file size is not increasing.

image

This would seem to indicate the validation parser is not passing data on to the SD logging routine. I would expect the parser to throw a few sentences out across multiple hours. The fact it's throwing out multiple seconds worth of data is wrong.

nseidle commented 1 year ago

tldr; The release candidate should have some major improvements. Please load v3.5RC when possible and report back.

As usual, thanks for reporting this and sorry for the data corruptions!

To replicate this issue:

image

image

Above: This is how the system works. The ESP32 takes in NMEA from the ZED transmits that NMEA to the phone over BT. Everything works fine, there are no NMEA or UBX sentence corruptions (ie, no data is dropped).

image

Above: If we start transmitting RTCM/NTRIP corrections from the phone, back to the ESP32, and back to the ZED, we begin to see bytes dropped (from the ZED-ESP NMEA communication).

image

Above: Interestingly, if we suppress the transmission of serial from the ESP32 to the ZED (commenting out lines in the Task.ino btReadTask code), there is no data corruption. It's therefore presumed that the BT stack is not interfering with the ESP32 UART2 (it's not an interrupt issue).

image

Above: Additionally, I verified the ZED was correctly outputting serial at all times. It is. The ESP32 is dropping characters out of the UART.

Things we’ve tried:

Nothing above decreased the number of corrupted sentences coming from the ZED when the ESP32 is writing data.

What is especially telling is the reduction of the UART speed. If we are slowing down the UART speed, all interrupts and tasks have a lot more time to operate before data is lost, yet we still lose data.

I believe there is a very low level issue with the esp32-idf that causes the UART FIFO to misbehave when we are simultaneously reading and writing to it. I am not able to figure it out at this time. However, there is a simple fix:

image

Above: We can push the RTCM over I2C rather than UART, thus avoiding a serial2.write() while we are doing a serial2.read().

Originally I was seeing corruptions every ~10s or so. With the fix in place, I have not seen a corrupted message at 10Hz, with 10 messages turned on, in 15 minutes.

There is still work to do:

Stefal commented 1 year ago

Thanks Nathan,

This RC is uploaded on my express. I will do a quick test tomorrow evening.

Stefal commented 1 year ago

Tests results with the 3.5RC, during about 1,5 hour :

At least 90% of what you see in the screenshot should have been green (RTK Fix). (5hz and 4Hz, NMEA + NAV-PVT and NAV-HPPOSLLH) image

Next tests : default settings with "automotive" dynamic model. At least 90% of what you see should have been green. Blue is dgps, orange is 3D and yellow is for Float. image

The conclusion : it's a failure.

Good points :

It looks like the F9P doesn't always received the rtcm data stream.

pyrog commented 1 year ago

Could you swap RTCM and UBX/NMEA ?

Eric-FR commented 1 year ago

Following your investigations, I made new tests.

RTK Express Firmware 3.0

Dynamic : automotive UBX+NMEA at 4 Hz

Rover static on balcony Android smartphone connected to caster with wifi+FTTH. Start up 1 hour recording Shut down 4 files I first checked integrity with UBX_Integrity_Checker (related to https://github.com/sparkfun/SparkFun_RTK_Firmware/issues/345):

-No connection : Integrity OK. -BT connection with SW_Maps, NMEA reading, no RCTM sent: Integrity OK. -BT connection with Lefebure, NMEA + RCTM : 15 resyncs -BT connection with SW_Maps, NMEA + RCTM : 24 resyncs

For the last two files, I also checked the age of differentials:

image

It is still unstable, like in: https://github.com/sparkfun/SparkFun_RTK_Firmware/issues/359

nseidle commented 1 year ago

I was able to replicate Stefal's report: RTK lock would be lost after a brief amount of time (less than 10 minutes).

@PaulZC pointed out the flaw in my solution. We can't push single bytes to the ZED over I2C. Additionally, we were not servicing the Bluetooth buffer before characters were dropped leading to 'RX Full! Discarding X bytes' errors (BT RX buffer is 512 bytes, I2C comm was blocking). A buffer has been implemented. I've had a solid RTK lock for the last hour.

Sorry for the hassle! Please give v3.5 RC June 19th a try.

Stefal commented 1 year ago

Tests with the v3.5 RC June 19th:

I don't see any improvement.

image

image

Stefal commented 1 year ago

@nseidle As you have good results with the latest RC, I had some concerns about my hardware. So I powered up back my old logger. I drove on the same road, with the same antenna and using the same CORS.

Here is the result, more than 96% of RTK Fix

image

nseidle commented 1 year ago

How did you feed the RTCM back to the ZED without a Bluetooth connection?

Stefal commented 1 year ago

How did you feed the RTCM back to the ZED without a Bluetooth connection?

I don't understand. I always had a bluetooth connection.

Edit : Ho ! Got it !

My logger is a small fork of the one from @PaulZC. I have added a bluetooth module connected on the ZED UART2

nseidle commented 1 year ago

Ah, ok, thanks! I was just looking at the photos and was confused.

Can you walk me through the steps you use to process the ubx files through rtklib?

Stefal commented 1 year ago

Rtklib (only rtkplot) is only used to display the trace and get % stats for single/float/fix:

I can open the same pos file in josm, the desktop app to edit OpenStreetMap. I like this way because I use it everyday for Osm contribution, and for all the aerial image sources available : image

nseidle commented 1 year ago

Excellent! Thank you. I captured a route in my car but did not have the extra msgs turned on. I'll capture another tomorrow.

Stefal commented 1 year ago

This morning I spent some times with U-Center on other ubx messages, like MON-MSGPP, MON-COMMS, RXM-COR, RXM-RTCM. I noticed one thing : With MON-MSGPP, if I compare the rtcm message count during 1 minute on my old logger and on the RTK Express, I got:

I'm too busy to go deeper, but it looks like an interesting point.

nseidle commented 1 year ago

Thank you Stefal! This is very helpful.

nseidle commented 1 year ago

Below is a ~10 minute drive with a baseline of 11.5km (7 miles) using an RTK Express, running v3.5 (release version), and a u-blox magnetic mount antenna. Seems fine to me.

image

Additionally, your comments about RTCM msg/min had me go looking.

image

Above is a 5 minute capture via u-center RXM-RTCM. While the ZED should have received 336 messages, a few of the sentence types (1124 in particular) are getting corrupted via I2C (~10% worst case). I would not expect a loss of 10% of a particular message to cause the ZED to drop to RTK Float. I've seen the ZED maintain RTK fix for nearly 30 seconds without RTCM so a lost message, very periodically should not be causing your issue. The mystery continues!

I cannot yet explain why you're dropping so much RTCM via Bluetooth but we'll keep working on it. I think there's a few more spots I can improve.

Stefal commented 1 year ago

The base station you've used seems to not send many msm messages. It could explain why we got so different results.

Here are the results with my own base station and its default settings, after a 5 minutes capture. RTCM messages: 1004,1005(10),1006,1008(10),1012,1019,1020,1033(10),1042,1045,1046,1077,1087,1097,1107,1127,1230

1129 vs 2710 messages, that's a big difference !

If you want to test, you can use my own base (mount point EMONIE on caster.centipede.fr:2101)

nseidle commented 1 year ago

Ah, well that is a lot! That would explain it. The Bluetooth connection is getting overwhelmed.

I think I've got a solution for you. I just pushed v3.6RC. Please install and then use the debug menu to increase the SPP RX buffer to 2048 (shown below):

image

I was able to get ~5 minutes of RTCM from your mount point without loosing anything. image

As a bit of history (and notes to myself): Prior to Aug 17, 2022 the firmware had a larger SPP RX buffer of 2048 bytes. So since version v2.4, the firmware has been using a smaller 512 byte Bluetooth SPP RX buffer. If you've been using this mount point for the last year, I suspect this Bluetooth overrun issue has been causing problems for you for some time. My apologies. I did not know there were mount points out there with this much data being transmitted.

Eric-FR commented 1 year ago

I made new tests. On balcony with centipede base 10 km away. Dynamic : vehicle. 4 Hz. 30 mn acquisition. Firmware 3.2 Integrity check showed 6 resync. image

Firmware 3.6RC Integrity check: no error. image

So, the two problems that I mentioned earlier are gone.

Stefal commented 1 year ago

I think I've got a solution for you. I just pushed v3.6RC. Please install and then use the debug menu to increase the SPP RX buffer to 2048 (shown below):

This my result with a 48km log. It is much better ! Thanks ! image

I'll do some other tests later, maybe next week.

Stefal commented 1 year ago

New test with a 100 + 100km trip. Everything seem right: image

nseidle commented 1 year ago

Excellent! Thanks for testing. I'll close this once we do a full release.

nseidle commented 1 year ago

Thanks for working through this issue with us. I'm glad we got it fixed! This has been added to the latest v3.6 release.