Logging issue : a lot of files and logs drops

Stefal commented 1 year ago

Subject of the issue

We are using some RTK Express to log data when we walk or drive. Instead of one file for each trip, we have a lot of files, drops inside the files, and too much float or single fix.

Your workbench

What version of RTK firmware are you running? Tested with 3.1, 3.2, 3.3
4 differents RTK Express
3 differents antenna
various SD card
2 to 5 Hz
messages enabled: NMEA + nav-pvt + nav-hpposllh + nav-sat + nav-gpstime + nav-utctime

Steps to reproduce

Power on, connect to a ntrip caster with lefebure ntrip client. Bluetooth link between the smartphone and the RTK Express.

Expected behavior

Only one file, and much more RTK fix

Actual behavior

2 hours log give up to 10 files with a lot of cuts. some example:

Stefal commented 1 year ago

new test with the firmware v3.0: The test was short, but I didn't have multiple ubx files. However there are some holes in the log

pobsteta commented 1 year ago

Idem on RTK Facet with firmware 3.2

nseidle commented 1 year ago

@Stefal - We found a fixed a particularly bad bug on May 18th that was causing issues with corrupt log file names and unexpected resets. This bug was introduce on Feb 28 so v3.1 should have been good, while v3.2 and v3.3 have this bug.

Can you share some of the sections of the log where the RTK Fix is lost? Can you tell if the break is due to a system reset or does the system simply lose RTK Fix and then re-gain it?

Eric-FR commented 1 year ago

Same with a RTK Express with 3.2 firmware. Tracks are divided in numerous different files. Sometime, I can observe device reboot. I lost several tracks: while RTK was active and used for Mapillary, I can't find any file on the SD card.

Foot walk with RTK Express recording NMEA+UBX at 1 Hz.

Stefal commented 1 year ago

Can you share some of the sections of the log where the RTK Fix is lost? Can you tell if the break is due to a system reset or does the system simply lose RTK Fix and then re-gain it?

I will wait for a new release with the various fixes you have mentionned, and redo some tests. I think that parts of my problems are related to the 5hz gnss frequency that the bluetooth stream can't handle. I forgot that log to file and bluetooth settings are the same, contrary to my old logger in which the stream are independant (1 or 2hz for bluetooth, and 5hz or more with many ubx messages for log to SD card).

Stefal commented 1 year ago

Test with the 3.4RC (RTK Firmware: v3.4-May 25 2023) :

About 90mn log
5hz, with all nmea message rate set to 5 or more
nav-pvt, nav-hpposllh, nav-sat, nav-utctime, nav-gpstime
Connected to my own gnss base, with Lefebure Ntrip Client.

Result:

Multiple files (9) and $GNTXT,01,01,00,ESP_RST_SW*46 messages
Some drops (about 5 seconds without data)

I will share the logs with you in private (email)

nseidle commented 1 year ago

I've replicated the issue. An overnight run produced ~30 logs. The unit is closing old logs and opening new ones because no new data is being received for more than 5 seconds:

I haven't yet figured out why the ESP32 thinks the ZED is not outputting data. TBD!

nseidle commented 1 year ago

Serial bytes are all being read correctly even when the file size is not increasing.

This would seem to indicate the validation parser is not passing data on to the SD logging routine. I would expect the parser to throw a few sentences out across multiple hours. The fact it's throwing out multiple seconds worth of data is wrong.

nseidle commented 1 year ago

tldr; The release candidate should have some major improvements. Please load v3.5RC when possible and report back.

As usual, thanks for reporting this and sorry for the data corruptions!

To replicate this issue:

Remove any SD card
Reset device to defaults
Enable CRC printing - option '20' on debug menu
Connect to device over BT
Connect NTRIP and push corrections to RTK Device over BT
View output. You will see bad checksums that look like this:

Above: This is how the system works. The ESP32 takes in NMEA from the ZED transmits that NMEA to the phone over BT. Everything works fine, there are no NMEA or UBX sentence corruptions (ie, no data is dropped).

Above: If we start transmitting RTCM/NTRIP corrections from the phone, back to the ESP32, and back to the ZED, we begin to see bytes dropped (from the ZED-ESP NMEA communication).

Above: Interestingly, if we suppress the transmission of serial from the ESP32 to the ZED (commenting out lines in the Task.ino btReadTask code), there is no data corruption. It's therefore presumed that the BT stack is not interfering with the ESP32 UART2 (it's not an interrupt issue).

Above: Additionally, I verified the ZED was correctly outputting serial at all times. It is. The ESP32 is dropping characters out of the UART.

Things we’ve tried:

Added a non-blocking TX ring buffer to esp_hal_uart.c
Increased the UART interrupt to level 3 (max that it can be) in esp_hal_uart.c
Decreased the UART baud to 57600. This should slow data rate to 5760kBytes/s. The buffer is 127 bytes so we have ~22ms.
Decreasing task overhead (ie, removing Bluetooth escape character scanning, removing Bluetooth writing, remove SD card, etc)
Decreasing serialGNSSRxFullThreshold (this should allow gnssReadTask to harvest the data faster)
Moved the UART interrupts and tasks to core 0 (from core 1).
Moved the Bluetooth tasks to core 0 (from core 1).
Increasing the priority of the gnssReadTask, decreasing the priority of the btReadTask
Increasing BT SPP TX/RX buffers sizes (including SPP_TX_MAX)
Increase uartReceiveBufferSize (this should allow the gnssReadTask more time to service the buffer)

Nothing above decreased the number of corrupted sentences coming from the ZED when the ESP32 is writing data.

What is especially telling is the reduction of the UART speed. If we are slowing down the UART speed, all interrupts and tasks have a lot more time to operate before data is lost, yet we still lose data.

I believe there is a very low level issue with the esp32-idf that causes the UART FIFO to misbehave when we are simultaneously reading and writing to it. I am not able to figure it out at this time. However, there is a simple fix:

Above: We can push the RTCM over I2C rather than UART, thus avoiding a serial2.write() while we are doing a serial2.read().

Originally I was seeing corruptions every ~10s or so. With the fix in place, I have not seen a corrupted message at 10Hz, with 10 messages turned on, in 15 minutes.

There is still work to do:

This fix increases the I2C traffic by about 530 bytes per second (~6% of max throughput of the bus at 100kHz). This is not a lot considering all the other I2C traffic (display and general ZED messages) but it is an increase.
Control what data is sent back over Bluetooth. If we can avoid sending UBX to the phone, that would be a good thing.
We still close log/open new log if the log file size is not increasing. We may want to change this in the future.

Stefal commented 1 year ago

Thanks Nathan,

This RC is uploaded on my express. I will do a quick test tomorrow evening.

Stefal commented 1 year ago

Tests results with the 3.5RC, during about 1,5 hour :

At least 90% of what you see in the screenshot should have been green (RTK Fix). (5hz and 4Hz, NMEA + NAV-PVT and NAV-HPPOSLLH)

Next tests : default settings with "automotive" dynamic model. At least 90% of what you see should have been green. Blue is dgps, orange is 3D and yellow is for Float.

The conclusion : it's a failure.

Good points :

No multiple files
No errors detected with ubx integrity checker.

It looks like the F9P doesn't always received the rtcm data stream.

pyrog commented 1 year ago

Could you swap RTCM and UBX/NMEA ?

UBX/NMEA on I2C
RTCM on UART1

Eric-FR commented 1 year ago

Following your investigations, I made new tests.

RTK Express Firmware 3.0

Dynamic : automotive UBX+NMEA at 4 Hz

Rover static on balcony Android smartphone connected to caster with wifi+FTTH. Start up 1 hour recording Shut down 4 files I first checked integrity with UBX_Integrity_Checker (related to https://github.com/sparkfun/SparkFun_RTK_Firmware/issues/345):

-No connection : Integrity OK. -BT connection with SW_Maps, NMEA reading, no RCTM sent: Integrity OK. -BT connection with Lefebure, NMEA + RCTM : 15 resyncs -BT connection with SW_Maps, NMEA + RCTM : 24 resyncs

For the last two files, I also checked the age of differentials:

It is still unstable, like in: https://github.com/sparkfun/SparkFun_RTK_Firmware/issues/359

nseidle commented 1 year ago

I was able to replicate Stefal's report: RTK lock would be lost after a brief amount of time (less than 10 minutes).

@PaulZC pointed out the flaw in my solution. We can't push single bytes to the ZED over I2C. Additionally, we were not servicing the Bluetooth buffer before characters were dropped leading to 'RX Full! Discarding X bytes' errors (BT RX buffer is 512 bytes, I2C comm was blocking). A buffer has been implemented. I've had a solid RTK lock for the last hour.

Sorry for the hassle! Please give v3.5 RC June 19th a try.

Stefal commented 1 year ago

Tests with the v3.5 RC June 19th:

I don't see any improvement.

static test, about 40% in single mode (gps/dgps) :

moving test (car), about 66% in single mode :

Stefal commented 1 year ago

@nseidle As you have good results with the latest RC, I had some concerns about my hardware. So I powered up back my old logger. I drove on the same road, with the same antenna and using the same CORS.

Here is the result, more than 96% of RTK Fix

nseidle commented 1 year ago

How did you feed the RTCM back to the ZED without a Bluetooth connection?

Stefal commented 1 year ago

How did you feed the RTCM back to the ZED without a Bluetooth connection?

I don't understand. I always had a bluetooth connection.

Edit : Ho ! Got it !

My logger is a small fork of the one from @PaulZC. I have added a bluetooth module connected on the ZED UART2

nseidle commented 1 year ago

Ah, ok, thanks! I was just looking at the photos and was confused.

Can you walk me through the steps you use to process the ubx files through rtklib?

Stefal commented 1 year ago

Rtklib (only rtkplot) is only used to display the trace and get % stats for single/float/fix:

In the RTK Surveyor/Express/... I enable NAV-HPPOSLLH, NAV-PVT, NAV-TIMEUTC messages.
I convert the ubx file to a .pos file with my old and dirty python script : https://github.com/Stefal/ubx-to-gpx (please use a venv)
I open the .pos file in rtkplot

I can open the same pos file in josm, the desktop app to edit OpenStreetMap. I like this way because I use it everyday for Osm contribution, and for all the aerial image sources available :

nseidle commented 1 year ago

Excellent! Thank you. I captured a route in my car but did not have the extra msgs turned on. I'll capture another tomorrow.

Stefal commented 1 year ago

This morning I spent some times with U-Center on other ubx messages, like MON-MSGPP, MON-COMMS, RXM-COR, RXM-RTCM. I noticed one thing : With MON-MSGPP, if I compare the rtcm message count during 1 minute on my old logger and on the RTK Express, I got:

Old logger (uart2) : about 510 rtcm msg/mn
RTK Express (i2c) : about 250 rtcm msg/mn

I'm too busy to go deeper, but it looks like an interesting point.

nseidle commented 1 year ago

Thank you Stefal! This is very helpful.

nseidle commented 1 year ago

Below is a ~10 minute drive with a baseline of 11.5km (7 miles) using an RTK Express, running v3.5 (release version), and a u-blox magnetic mount antenna. Seems fine to me.

Additionally, your comments about RTCM msg/min had me go looking.

Above is a 5 minute capture via u-center RXM-RTCM. While the ZED should have received 336 messages, a few of the sentence types (1124 in particular) are getting corrupted via I2C (~10% worst case). I would not expect a loss of 10% of a particular message to cause the ZED to drop to RTK Float. I've seen the ZED maintain RTK fix for nearly 30 seconds without RTCM so a lost message, very periodically should not be causing your issue. The mystery continues!

I cannot yet explain why you're dropping so much RTCM via Bluetooth but we'll keep working on it. I think there's a few more spots I can improve.

Stefal commented 1 year ago

The base station you've used seems to not send many msm messages. It could explain why we got so different results.

Here are the results with my own base station and its default settings, after a 5 minutes capture. RTCM messages: 1004,1005(10),1006,1008(10),1012,1019,1020,1033(10),1042,1045,1046,1077,1087,1097,1107,1127,1230

RTK Express with 3.5 release
My old logger

1129 vs 2710 messages, that's a big difference !

If you want to test, you can use my own base (mount point EMONIE on caster.centipede.fr:2101)

nseidle commented 1 year ago

Ah, well that is a lot! That would explain it. The Bluetooth connection is getting overwhelmed.

I think I've got a solution for you. I just pushed v3.6RC. Please install and then use the debug menu to increase the SPP RX buffer to 2048 (shown below):

I was able to get ~5 minutes of RTCM from your mount point without loosing anything.

As a bit of history (and notes to myself): Prior to Aug 17, 2022 the firmware had a larger SPP RX buffer of 2048 bytes. So since version v2.4, the firmware has been using a smaller 512 byte Bluetooth SPP RX buffer. If you've been using this mount point for the last year, I suspect this Bluetooth overrun issue has been causing problems for you for some time. My apologies. I did not know there were mount points out there with this much data being transmitted.

Eric-FR commented 1 year ago

I made new tests. On balcony with centipede base 10 km away. Dynamic : vehicle. 4 Hz. 30 mn acquisition. Firmware 3.2 Integrity check showed 6 resync.

Firmware 3.6RC Integrity check: no error.

So, the two problems that I mentioned earlier are gone.

Stefal commented 1 year ago

I think I've got a solution for you. I just pushed v3.6RC. Please install and then use the debug menu to increase the SPP RX buffer to 2048 (shown below):

This my result with a 48km log. It is much better ! Thanks !

I'll do some other tests later, maybe next week.

Stefal commented 1 year ago

New test with a 100 + 100km trip. Everything seem right:

nseidle commented 1 year ago

Excellent! Thanks for testing. I'll close this once we do a full release.

nseidle commented 1 year ago

Thanks for working through this issue with us. I'm glad we got it fixed! This has been added to the latest v3.6 release.

sparkfun / SparkFun_RTK_Firmware