Issue during httplib calls for fetching assisted data when CMUX is active - needed for assistnow

arnoutdekimo commented 8 months ago

Hi,

Restarting from scatch from issue https://github.com/u-blox/ubxlib/issues/162 :

The situation:

I am ultimately trying to get assisted data onto my SARA-R422M10S.
Since this device embeds the GNSS into one, I have modified the assistNow example to set the httpDevHandlehandle to the gnssDevHandle. I also only open a single device with uDeviceOpen, but then use uNetworkInterfaceUp for both the GNSS network type and then the U_NETWORK_TYPE_CELL network type.
This "works", but after doing the GNSS bringup, all AT communication is muxed (CMUX), complicating things
Eventually, the actual HTTP transfer for the assisted data 1) sometimes works 2) sometimes gives incorrect data, crashing MGA 3) sometimes makes the application hang for 10+s before concluding it did not receive enough data
To further debug this, I threw out the GNSS part, and simply reverted to the cellular-only config, which simply downloads the file in question:
I can run the httplib on big files (e.g. the offline assisted data at offline-live1.services.u-blox.com/GetOfflineData .. (around 2kB) per file. And I can do so every time succesfully. I added a debugprint in uAtClientReadBytes to show what it was requesting and what was returend for a bit of a hint.

This looks a bit like this:

0:00:09:095 - OK 0:00:09:095 - HTTP GET request will be "/GetOfflineData.ashx?token=HCJx9y-_RKepara6bQJ3pw;gnss=gps;days=1;resolution=1". 0:00:09:095 - AT+UHTTPC=0,1,"/GetOfflineData.ashx?token=HCJx9y-_RKepara6bQJ3pw;gnss=gps;days=1;resolution=1","ubxlibhttp_0" 0:00:09:209 - 0:00:09:249 - 0:00:09:249 - 0:00:09:250 - OK 0:00:09:250 - 0:00:10:632 - +UUHTTPCR: 0,1,1 0:00:10:633 - ATclientreadbytes-READrequest 1 bytes 0:00:10:787 - ATclientreadbytes-result 1 bytes 0:00:10:788 - ATclientreadbytes-READrequest 64 bytes 0:00:10:788 - ATclientreadbytes-result 64 bytes 0:00:10:922 - ATclientreadbytes-READrequest 1 bytes 0:00:11:065 - ATclientreadbytes-result 1 bytes 0:00:11:066 - ATclientreadbytes-READrequest 1024 bytes 0:00:11:066 - ATclientreadbytes-result 1024 bytes 0:00:11:301 - ATclientreadbytes-READrequest 1 bytes 0:00:11:443 - ATclientreadbytes-result 1 bytes 0:00:11:444 - ATclientreadbytes-READrequest 1024 bytes 0:00:11:444 - ATclientreadbytes-result 1024 bytes 0:00:11:674 - ATclientreadbytes-READrequest 1 bytes 0:00:11:816 - ATclientreadbytes-result 1 bytes 0:00:11:817 - ATclientreadbytes-READrequest 1024 bytes 0:00:11:817 - ATclientreadbytes-result 1024 bytes 0:00:12:047 - ATclientreadbytes-READrequest 1 bytes 0:00:12:189 - ATclientreadbytes-result 1 bytes 0:00:12:190 - ATclientreadbytes-READrequest 556 bytes 0:00:12:190 - ATclientreadbytes-result 556 bytes 0:00:12:374 - ATclientreadbytes-READrequest 1 bytes 0:00:12:517 - ATclientreadbytes-result 1 bytes 0:00:12:518 - ATclientreadbytes-READrequest 0 bytes 0:00:12:518 - ATclientreadbytes-result 0 bytes 0:00:12:518 - HTTP GET response received, 2604 byte(s). 0:00:12:518 - AT+CCLK? 0:00:12:619 - 0:00:12:659 - +CCLK: "23/11/09,16:18:38+04" 0:00:12:660 - 0:00:12:661 - OK 0:00:12:661 - U_CELL_INFO: time is 23/11/09,16:18:38+04. 0:00:12:661 - U_CELL_INFO: local time is 1699546718, timezone offset 3600 seconds, hence UTC time is 1699543118.

On the UART bus, the data looks like this (took 5 parts, where first 64bytes were requested, then 1024:

As mentioned, this works fine every time, in the sense that a "big" file is retrieved, and the code never hangs.

I've tested this now 20+ times, and this .. just works. (regardless of whether I internally use DMA or single-char interupt, which gives a totally different input rate of characters)

But then, if I modify my code to then first call the uNetworkInterfaceUp with GNSS first, after opening the device, and then calling uNetworkInterfaceUp for the cellular and then doing the HTTPlib lib fetching, things start to go wrong.

The same output changes. Now, upon my first try, the library hung while doing the http client call.

0:00:13:401 - +UUHTTPCR: 0,1,1 0:00:13:402 - ATclientreadbytes-READrequest 1 bytes 0:00:13:567 - ATclientreadbytes-result 1 bytes 0:00:13:568 - ATclientreadbytes-READrequest 64 bytes 0:00:13:568 - ATclientreadbytes-result 64 bytes 0:00:13:704 - ATclientreadbytes-READrequest 1 bytes 0:00:13:859 - ATclientreadbytes-result 1 bytes 0:00:13:860 - ATclientreadbytes-READrequest 1024 bytes 0:00:13:860 - HTTP GET request failed (-9)! 0:01:11:573 -

(Note - the timestamp of the message is actually appended after it, prepending the previous message. So here, we got a timeout of about 1 minute until we get the request failed)

On the UART bus, you can see that the data is now ecapsulated:

The bus becomes idle at this point:

Note that during the 1kB transfer, the MCU still sends other stuff:

These are the latest commands on the bus

I've been trying to step through the library and make out where things go wrong, but I just get lost. All I can say is that when the GNSS + cellular get activated for the same device, corruption seems to happen during long transfers (which is exactly the case with assistnow)

So my questions:

Is what I am doing correct? (should I use this single device an bring up its two network interfaces this way?)
Should the CMUX correclty work?
How can we further debug this issue / find the fix for this? (aside from disabling the GNSS network while doing HTTP transfers)?

Thanks for your feedback

arnoutdekimo commented 8 months ago

By the way, you can manually inspect the UART data if you want by looking at this file mcu-hangs-after-1kbfetch.zip You can open it with the free software of https://www.saleae.com/downloads/

RobMeades commented 8 months ago

Excellent information, let me digest...

[We use Saleae all the time so am familiar with that].

arnoutdekimo commented 8 months ago

(And here is a dump with the good situation) workingdump-nognssactive.zip

RobMeades commented 8 months ago

Quick comment: it is possible for the cellular module to do the download of AssistNow stuff to the internal GNSS chip automonously by itself, without you having to get involved, see https://github.com/u-blox/ubxlib/blob/e3d31cd71a2ed5f972995f7c7ddbbc79be14dad2/cell/api/u_cell_loc.h#L365 Is this something you've looked at and rejected?

arnoutdekimo commented 8 months ago

And here is the modified assist_now example that I also have the isuse with. assist_now_main.zip

arnoutdekimo commented 8 months ago

Quick comment: it is possible for the cellular module to do the download of AssistNow stuff to the internal GNSS chip automonously by itself, without you having to get involved, see

https://github.com/u-blox/ubxlib/blob/e3d31cd71a2ed5f972995f7c7ddbbc79be14dad2/cell/api/u_cell_loc.h#L365

Is this something you've looked at and rejected?

Aha, interesting.

No, tbh I just looked at the examples, found the assist_now_main, and worked further with that.

RobMeades commented 8 months ago

To be clear, I'm not saying you should change tack, just thought I'd mention the possibility in the sense of getting you working as quickly as possible. There is plainly something to debug here with long transfers.

As burbled about here, CMUX is always a compromise since flow control isn't "real" any more, there's an underlying UART transport carrying three separate channels: a control channel (0), an AT channel (1) and a GNSS channel (3 I think). If (0) ever gets flow controlled off by the UART HW because there's lots of stuff landing on (1) or (3) then you're screwed, frankly; (0) has to somehow always get through, 'cos it manages the flow control for all other channels, yet that can neve be guaranteed 'cos the UART HW sees no difference between them.

Anyway, not saying this is your issue. Question: what platform is this on (sorry if you mentioned and I forgot)? I would like to build and run your code on hopefully the same thing, or at least close to it.

FYI, I as we speak I am in the process of bringing up PPP over CMUX, an alignment in the stars :-).

arnoutdekimo commented 8 months ago

:) Great. My platform is an STM32 variant. I started off with the stmcube port, and then .. ported that port to the STMU5 series I am now using. I initially did test it with the STM32F4, which was the officially supported STM32 part.

The flow control notes you mention are indeed very interesting:)

RobMeades commented 8 months ago

I have written a small Python script that parses the CSV-exported output of the Saleae probe (set hex display format for both analyzers, rename them to something sensible (e.g. "Tx" and "Rx") and export to CSV with ISO8601 timestamps) and prints out the CMUX frames; the script is attached in case it is of use.

Running it on your mcu-hangs-after-1kbfetch.zip .sal file I get the attached output. The interesting bit at the end I have broken down below:

We have read the content length, then gone on to read the first full 1024 bytes of HTTP response, all of which appears to have been sent by the module. The bits to focus on, though, are the bits in blue.

The first bit in blue is ubxlib sending a CMUX flow-control frame on the control channel telling the module to stop please. It is sent after 3 x 128 bytes have arrived at the MCU, 37 ms after the stuff started arriving, but doesn't have any immediate effect as another 5 x 128 bytes land afterwards, Flow control off is sent in the mux code when the receive buffer is 60% full, so pretty early, and the protocol is such that we must wait for an acknowledgement.

The second blue bit you can see is that acknowledgement arriving back from the module: this is after the entire AT transaction has completed, 61 ms after the stop request was sent. Basically, flow control is reeeaaaallllyyy sloooooow over CMUX, and I suspect that data may be being thrown away because of that.

However the remainder of the trace looks suspicious: the ubxlib CMUX code appears not to like the flow control acknowledgement from the module and resends the flow control off request and this continues, with module acknowledgements, to the end. We do run test of 5 kbyte HTTP downloads and these should be run on two platforms with CMUX active but I'm wondering now if some how it is not.

EDIT: not suspicious in itself, see subsequent additional analysis below.

This I will reproduce tomorrow.

RobMeades commented 8 months ago

To reassure you that there isn't a fundamental problem here, find attached a log of our test system running the httpClient() test on an STM32F407 with a SARA-R422 connected (happens to be one of our test configurations). This test does 5 kbyte HTTP downloads so should be usefully representative.

I have done this with CMUX in operation and also, just to show that it is running, I have #defined U_CELL_MUX_ENABLE_DEBUG and U_CELL_MUX_ENABLE_USER_TX_DEBUG for the build, which prints out a shed load of debug during the test, making the log pretty much unreadable but, at least, proving the CMUX is doing stuff.

I'll see if I can do this with your code now.

RobMeades commented 8 months ago

I'm not having much luck here. I have taken your code and modified it only so that I can easily run it on the test system, should be no changes of substance (code is attached). I have run this with the same two debug #defines as indicated above on the same test system HW as I ran the httpClient() test: STM32F407 connected to a SARA-R422M8S (I don't have a SARA-R422M10S but I don't believe the GNSS chip version should be a factor in this).

This seems to be able to download 2604 bytes over HTTP and send it to the GNSS device etc., messy log with all the CMUX debug splattered throughout it attached.

When this failed for you was it with more HTTP data than that? As you can see in the log, if you look at the detail, there is a flow control stall in there that it recovers from without apparent loss of data (the GNSS device would NACK any data that was corrupted). Might there be other differences, elsewhere in your build, that I need to adopt?

Maybe, if you get a moment, you could re-run the simple case you had the Saleae log for above but also with U_CELL_MUX_ENABLE_DEBUG and U_CELL_MUX_ENABLE_USER_TX_DEBUG defined for the build, and save the log file so that I can take a look at it (also the .sal file if you like and I can then match the two).

Also, checking those flow control requests at the end of the image I pasted above in more detail, they do actually make sense:

There is one more "STOP" request towards the module, which is acked, but then the four which follow are "START" requests, i.e. telling the module to send data. These will be issued when a read is performed from the CMUX while flow control is in "STOP" state (i.e. meaning that ubxlib had sent a "STOP" request and has not yet received an ack for that "STOP" request). So every 20ish milliseconds the module is being prodded with a "please send me data" message, and it eventually acknowledges all of those. This suggests that the CMUX behaviour is correct, what I don't understand is why there isn't another AT+URDBLOCK=, i.e. at HTTP level, to actually ask for more data from the file following that.

EDIT: ...unless there was data loss. If the end got lopped off the first AT+URDBLOCK the AT parser is likely sitting around for 10 seconds waiting for the data it would have been told is coming to turn up.

RobMeades commented 8 months ago

One more thing you might try if you wish: the buffer size that is used within the CMUX for the AT-command data flow is U_CELL_MUX_PRIVATE_VIRTUAL_SERIAL_BUFFER_LENGTH_BYTES. This is set to four times the maximum length of the payload of any one CMUX I-frame on the basis that the module must listen to our flow control request within that time.

In your case, from the evidence of the .sal file, it seem that it is not, so you could #define for your build U_CELL_MUX_PRIVATE_VIRTUAL_SERIAL_BUFFER_LENGTH_BYTES to, say 1024 (you can define U_CELL_MUX_PRIVATE_VIRTUAL_SERIAL_BUFFER_LENGTH_BYTES at build-level and it will override the default) and see if that helps.

arnoutdekimo commented 7 months ago

By the way, before trying your last recommendation, I ran into something else, more easy to analyze I think.

gnss_add_failure.zip

Just at the point directly after we send "AT+CMUX=0,0,,128", and get OK, I managed to take traces of two different situations. In the first situation, we send this:

in the second scenario, we send the exact same thing. but that does not give a reply.

I'll double check my signals on the scope..

arnoutdekimo commented 7 months ago

Scope looks just fine - measured at the EVK pin

This was taken as this exact command, upon which no reply arrived.

RobMeades commented 7 months ago

Interesting: as you might guess, F9 03 3f 01 1c F9 is the CMUX frame (SABM) sent from ubxlib that opens the CMUX control channel, channel 0, which should cause the module to enter CMUX mode. Just to confirm, your 'scope image is of that transmitted frame going to the TXD pin of the module (DCE naming convention) which, as you say, looks clean enough, and you see nothing coming back from the module on its RXD pin?

There is currently no retry on the SABM: I wonder if I should add one?

arnoutdekimo commented 7 months ago

I measured it at the TXD pin - pin 16 of J104 of the ADP-R422M10S yes.

So indeed, like in the captures, sometimes it does properly respond, and sometimes there is -silence-.

RobMeades commented 7 months ago

How interesting; a delay of U_CELL_MUX_PRIVATE_ENABLE_DISABLE_DELAY_MS (100 ms) is inserted after getting the OK back to the command that enables CMUX, before sending the SABM. Maybe this is not long enough under some circumstances?

If this is repeatable, would you be able to try hacking in a retry just here, so the code becomes something like this:

                            errorCode = (int32_t) U_ERROR_COMMON_DEVICE;
                            for (size_t x = 0; (x < 3) && (errorCode != 0); x++) {
                                errorCode = openChannel(pContext, U_CELL_MUX_PRIVATE_CHANNEL_ID_CONTROL, 0);
                                if (errorCode != 0) {
                                    uPortTaskBlock(U_CELL_MUX_PRIVATE_ENABLE_DISABLE_DELAY_MS);
                                }
                            }

See if that elicits a response?

arnoutdekimo commented 7 months ago

Tricky to reproduce - most of the time it works just fine I -think- it has something to do with if the module is -just- taken out of reset/poweroff (pwron pin), and then the commands are sent, that -then- it -sometimes- goes wrong.

I merged your code, and put a breakpoint on the taskblock. It eventually gets called, and retransmits, but without any reply

logfail-withretry2.zip

Note that after a while, it gives up, sends ATI, which then works, and the booting continues without CMUX being active.

BUT, I will admit that I noticed that these tests were done with U_AT_CLIENT_DEFAULT_DELAY_MS, U_AT_CLIENT_URC_TIMEOUT_MS and U_AT_CLIENT_STREAM_READ_RETRY_DELAY_MS set to 1ms. As a result, the boot sequence might be just a bit faster, illiciting this issue for some reason. Perhaps the module isn't ready to do any cmuxing until .. n ms after boot or whatever.

When reverting those defines, for now, it seems harder to reproduce, although, while typing this .. it just did. (Attaching that log to it as well => This was was taken with the stock "slow" defines)

logfail-withretry3.zip

In that second log, the retry -does- seem to produce a response? Although from the capture, I don't see the rest of the traffic being muxed afterwards either?

(Will post another reply regarding the other cmux things shortly)

arnoutdekimo commented 7 months ago

Regarding the original CMUX issues, I think I have a pretty important lead to what is also a factor.

Lately, I had not been having these issues anymore during the assistnow offline data downloads. That was strange, as the only thing I added was an explicit pwr-on-signal (3s) toggle at the start of the code. I think this is the clue though.

What I did previously was after a test (e.g. the assistnow.c), restart my application. The application then called uDeviceOpen , which tries to check a few things. I only have the pwr pin connected, so it just starts sending commands, and if it fails, it tries to power the module up. But if it was previously in CMUX state, that didn't work, so I'd manually hold the PWRON pin for a few seconds, restart the application, and then after the powerup sequence of the library everything started to work again. Or at least "somewhat".

This workflow seems to be at the root of lots of these issues, like failing http transfers, also MGA crashes, etc. It's still very weird that it -does- cause the crashes/misbehaving of the module in the first place, but it's very useful to know how to avoid it. (and in my real usecase, the module will always just come out of full powerdown, so its not relevant)

For reference, a log where I ran the assistnow example, first succesfully, and then using this "procedure", again, but the second time the http transfer returned really strange results.

fastrestart-leadingtoissues.zip fastrestart-with-jibberish.txt

In all honesty, I am personally very much inclined to ditch that old procedure, just power cycle the device before every test, and not further debug "why" the module has issues with this poweron behavior. That avoids me losing more time.

Note that the other issue, with the initial CMUX switch failing, now occured a second time (WITH the slow timings WITH me powercycling the device before a test), so I think that's the only real issue still remaining that is really blocking me sometimes

RobMeades commented 7 months ago

Ref. your logfail-withretry3.zip case, I think I probably haven't suggested the right code change but it is still quite interesting, the period of interest being the crimson bar:

The first SABM, for the control channel (F9 03 3f 01 1c F9), is sent to the module at the far left of the crimson bar, at 34.8 seconds and nothing comes back. With the code change I was expecting this SABM to be repeated 5 seconds later but what ends up getting sent, and only at about 45.6 seconds, 12ish seconds later, is the SABM for the AT channel (1) rather than the control channel (0): F9 07 3f 01 1c F9. There is no response to this, and indeed there shouldn't be 'cos the control channel must be opened first. Another CMUX frame gets sent to the module 5 seconds later (F9 07 53 01 3f F9), which is the ubxlib code saying "close the goldarned AT channel (1) then", as a clean-up 'cos nothing appears to be working; the weird thing is that this gets a proper CMUX acknowledgement from the module: F9 07 73 01 15 F9. The net result is that CMUX is not enabled, of course, hence the stuff that follows is not muxed.

So the module seems to be in CMUX mode, 'cos it has responded correctly to a CMUX message with an acknowledgement, but the ack to tells us that it is in CMUX mode was not sent.

Let me think again, and test locally, the correct code change to repeat the SABM for the control channel.

On your second post above, having a module "stuck" in CMUX mode can certainly happen: in case it is useful I'll point out uCellMuxModuleAbort(), which should send a magic "stop yer' CMUXing Mr Module" CMUX frame, but power cycling is in any case much better, nice and clean. I can't explain why your procedure of manually holding the PWR_ON pin low and restarting the application should result in the module being in a weird state: the module should still have been properly reset (I believe the PWR_ON pin is connected directly to the power management chip internally), and a reset is, or should be, a reset. We should probably watch this as it may be a pointer to some other issue we've not quite sussed yet.

Will get back to you on the first thing.

arnoutdekimo commented 7 months ago

By the way, even with properly toggling the power, I still see the following issue, that (in case cmux is active), the http download of the GNSS sometimes simply returns 0, while the modem clearly sends it

failure-to-see-http-data.txt failure-to-see-data.zip

Somehow the lib keeps waiting for something and eventually times out

(Note that in the meantime, I had increased U_CELL_MUX_PRIVATE_VIRTUAL_SERIAL_BUFFER_LENGTH_BYTES to 1024)

RobMeades commented 7 months ago

Very odd:

0:00:24:935 - +UUHTTPCR: 0,1,1
0:00:24:936 - HTTP GET response received, 0 byte(s).

+UUHTTPCR: 0,1,1 indicates to the code that a HTTP response has been received, which has clearly been seen 'cos it's been printed out, but, almost immediately, we get the "0 bytes" print, which I guess is the print here.

Between those two prints all of the action is happening in cellCallback(). Looking at the .sal file as text attached, the function seems to have read the HTTP status code from the file, then read the content length from the file, then asked to read the first 1024 bytes from the file and all of that seems to have succeeded.

I will analyze the CMUX formatting next but I have something else I must complete today so it may not be until tomorrow now.

RobMeades commented 7 months ago

If you get a chance, it would be interesting if you could reproduce the above problem with both U_CELL_MUX_ENABLE_DEBUG and U_CELL_MUX_ENABLE_USER_TX_DEBUG defined for the build. This will produce a load of prints splattered throughout the ubxlib log which give its view of what is arriving over the UART.

arnoutdekimo commented 7 months ago

I tried to reproduce it many times (with the extra debug), but did not have much luck. Until I unpowered the module, powered it again, and things started to misbehave. (I also logged issue https://github.com/u-blox/ubxlib/issues/167)

Anyway, in addition to this, I now get the 0-bytes received error again - with trace:

0bytesreceid-withdump.txt zerobytesreceived.zip

RobMeades commented 7 months ago

Great, I will get to that as quick as I can.

RobMeades commented 7 months ago

Looking at the log, I can see the initial 64 bytes of the response file being read to get the status code, which, from the .sal file, arrives in three CMUX frames:

12:50:50.554168 "Tx": [f9][07][ef][19]|AT+URDBLOCK=|[25][f9]
12:50:50.556614 "Tx": [f9][07][ef][03]|"|[d4][f9]
12:50:50.557753 "Tx": [f9][07][ef][19]|ubxlibhttp_0|[25][f9]
12:50:50.560175 "Tx": [f9][07][ef][03]|"|[d4][f9]
12:50:50.561336 "Tx": [f9][07][ef][03]|,|[d4][f9]
12:50:50.562487 "Tx": [f9][07][ef][03]|0|[d4][f9]
12:50:50.563644 "Tx": [f9][07][ef][03]|,|[d4][f9]
12:50:50.564805 "Tx": [f9][07][ef][05]|64|[30][f9]
12:50:50.566079 "Tx": [f9][07][ef][03]|[0d]|[d4][f9]
12:50:50.569423 "Rx": [f9][05][ef][3f]|[0a]+URDBLOCK: "ubxlibhttp_0",64,"|[98][f9]
12:50:50.711275 "Rx": [f9][05][ef][83]|HTTP/1.1 200 OK[0d][0a]Content-Type: application/ubx[0d][0a]Content-Length: "|[55][f9]
12:50:50.717653 "Rx": [f9][05][ef][0d]|[0d][0a]OK[0d][0a]|[5f][f9]

This is all good and will move offset along to after the \r\n on the end of the result code line, which I make to be 17 characters in, so what follows is a read from offset 17 for 1024 bytes to obtain the content length, from the .sal file point of view:

12:50:50.845749 "Tx": [f9][07][ef][19]|AT+URDBLOCK=|[25][f9]
12:50:50.848194 "Tx": [f9][07][ef][03]|"|[d4][f9]
12:50:50.849333 "Tx": [f9][07][ef][19]|ubxlibhttp_0|[25][f9]
12:50:50.851753 "Tx": [f9][07][ef][03]|"|[d4][f9]
12:50:50.852919 "Tx": [f9][07][ef][03]|,|[d4][f9]
12:50:50.854080 "Tx": [f9][07][ef][05]|17|[30][f9]
12:50:50.855343 "Tx": [f9][07][ef][03]|,|[d4][f9]
12:50:50.856495 "Tx": [f9][07][ef][09]|1024|[39][f9]
12:50:50.858004 "Tx": [f9][07][ef][03]|[0d]|[d4][f9]
12:50:50.861308 "Rx": [f9][05][ef][43]|[0a]+URDBLOCK: "ubxlibhttp_0",1024,"|[c5][f9]
12:50:51.003277 "Rx": [f9][05][ef][00][01]|Content-Type: application/ubx[0d][0a]Content-Length: 2604[0d][0a]Connection: close[0d][0a]Server: awselb/2.0[0d][0a]Date: Fri, 17 Nov 2023 12:50:49 GMT[0d]|[44][f9]
12:50:51.015187 "Rx": [f9][05][ef][00][01]|[0a]Cache-Control: max-age=7200[0d][0a]Content-Disposition: attachment; filename=mgaoffline.ubx[0d][0a]X-Cache: Miss from cloudfront[0d][0a]Via: 1.1 |[44][f9]
12:50:51.027048 "Rx": [f9][05][ef][00][01]|70ee39257364131aeb08a57b30a5dfb4.cloudfront.net (CloudFront)[0d][0a]X-Amz-Cf-Pop: BRU50-C1[0d][0a]X-Amz-Cf-Id: XXqhCQcm94NpDszC_tdO4Rdf-DOHy|[44][f9]
12:50:51.038908 "Rx": [f9][05][ef][00][01]|LMNTSDG8tpB594aAWzYSMzHXA==[0d][0a][0d][0a][b5]b[13] L[00][00][00][02][00][17][0b][11][0c][f0][08][10]~[c0][60][c0]?[60][01][00][e4]⌂j[ff][f2]Z[0e][e4]#X[ef][83][01][15]k[08][09]rZ[dd][00][1f][18][97][0f]\E[c0][e0];[1f]U[93][c3][e1]&[ec][b0][b1][bf][88][f2][04]S[fc][e0][1e]N"[cd][ef][fe]1[00][00][00][00]T*[b5]b[13] L[00][00][00][03][00][17][0b][11]|[44][f9]
12:50:51.050769 "Rx": [f9][05][ef][00][01]|[0c][f0][08][10]~[c0][60][c0]?[fd][06][00][e4]^Y[c1][07]Z[0e][e4][e3]X[ef][e3]|[ee][d7][08][09][1a]+[ff][15][1f][18][97][0f]\E[c0][e0][fd]M[e4]j[03][a4][af][ef][aa][db][82][ac][c2][c9][cd][0f][a2][9c][16]9[e2]R<[0b][00][00][00][00][ed]K[b5]b[13] L[00][00][00][04][00][17][0b][11][0c][f0][08][10][02][c0][60][c0]?[de][02][00][e4][01][99][fd][00]Z[0e][e4][e3]X[ef]#mg[f6][05][09][c2][fd][0d][19][1f][18][97][0f]\E[c0][e0][9a][1a]<|[44][f9]
12:50:51.062628 "Rx": [f9][05][ef][00][01]|[b6]*[ac]H[15]'[f3]h[95][bd][ec][8b][fd]Jx-.[ad],[ba][d8][00][00][00][00]c[19][b5]b[13] L[00][00][00][05][00][17][0b][11][0c][f0][08][10][fe][c0][60][c0][bf][9a][ff][0f][e4][fe]([c1][f9]Z[0e][e4]cX[ef][e3]f_[1c][08][09][8f]#[f6][06][1f][18][97][0f]\E[c0][e0][8b][a5]u[aa][a4][14][a1]m[a6]j[f2][db][91]{[e2][05][a5][d1][b6][d0][e2]ft[da][00][00][00][00][e2]~[b5]b[13] L[00][00][00][06][00][17][0b][11][0c][f0][08][10]|[44][f9]
12:50:51.074486 "Rx": [f9][05][ef][00][01]|[82][c0][60][c0][bf]?[fd][0f][e4]d'[00][0a]Z[0e][e4][e3]X[ef][83]c1[ce][08][09][13][1e]![16][1f][18][97][0f]\E[c0][e0]"[9a][dc]z=@[e3][f3]A[da][c6][1a]0p[d7]9[fe][17]J[fc][8e][15]F[ee][00][00][00][00][e5][9a][b5]b[13] L[00][00][00][07][00][17][0b][11][0c][f0][08][10][82][c0][60][c0]?F[fd][0f][e4][ce][cd][82][f9]Z[0e][e4]cX[ef][c3]hW[11][08][09]1[d8][00][09][1f][18][97][0f]\E[c0][e0][f4][01]K[fd].-{|[44][f9]
12:50:51.086348 "Rx": [f9][05][ef][00][01]|Xh([fc][0e][⌂\1[c8][04][dd][17][ea][be]w[e7][00][00][00][00][b6][f3][b5]b[13] L[00][00][00][08][00][17][0b][11][0c][f0][08][10][fe][c0][60][c0]?[80][ff][0f][e4][fe][83][80][0b]Z[0e][e4]cX[ef][03]}/[ca][08][09]=[06][f6][09][1f][18][97][0f]\E[c0][e0][d1][80][81][93][f3]x[83],[dd]j[b2]-[c8][c1][92][ed][b3]Mx5[ce]6[93][16][00][00][00][00]P[c6][b5]b[13] L[00][00][00][09][00][17][0b][11][0c][f0][08][10]~[c0][60][c0]|[44][f9]
12:50:51.098209 "Rx": [f9][05][ef][03]|"|[b5][f9]
12:50:51.098824 "Rx": [f9][05][ef][0d]|[0d][0a]OK[0d][0a]|[5f][f9]

This all looks good, from the point of view of the stuff travelling across the UART. You can see that, in the Rx direction, we have 1 CMUX frame with payload length 33 (+URDBLOCK: "ubxlibhttp_0",1024,"), 8 CMUX frames each of length 128 bytes, consisting of the HTTP response data, 1 CMUX frame with payload length 1 (the closing quote on the +URDBLOCK response) and then a final CMUX frame with payload length 6 bytes (\r\nOK\r\n).

When I look at what the debug prints say the code receives, it matches except that there are only 6 CMUX frames of length 128 bytes, rather than 8. The suspicious bit is at the start:

0:00:25:331 - U_CELL_CMUX: rx 4 byte(s) (ctrl 9/100, ring 4/405). <<< [f9][05][ef][43]
0:00:25:334 - U_CELL_CMUX: rx 0 byte(s) (ctrl 5/100, ring 4/405).
0:00:25:334 - U_CELL_CMUX: rx 35 byte(s) (ctrl 40/100, ring 39/405). <<< [0a]+URDBLOCK: "ubxlibhttp_0",1024,"[c5][f9]
0:00:25:337 - U_CELL_CMUX_1: rx frame type 0xef.
0:00:25:338 - U_CELL_CMUX_1: writing 33 byte(s) of decode I-field, buffer 0/1024.
0:00:25:338 - U_CELL_CMUX_1: app read 33 byte(s).

// At this point the application has likely read \r+URDBLOCK: "ubxlibhttp_0",1024,"
// the ring buffer is empty and we're looking for the start of a new CMUX frame

0:00:25:351 - U_CELL_CMUX: rx 29 byte(s) (ctrl 34/100, ring 29/405). <<< [f9][05][ef][00][01]Content-Type: applicatio
0:00:25:478 - U_CELL_CMUX: rx 0 byte(s) (ctrl 5/100, ring 29/405).
0:00:25:479 - U_CELL_CMUX: rx 95 byte(s) (ctrl 100/100, ring 124/405). <<< n/ubx\r\nContent-Length: 2604\r\nConnection: close\r\nServer: awselb/2.0\r\nDate: Fri, 17 Nov 2023 12:
0:00:25:490 - U_CELL_CMUX: rx 33 byte(s) (ctrl 38/100, ring 157/405). <<< 50:49 GMT\r[44][f9][f9][05][ef][00][01]\nCache-Control: 

// If all of the data had arrived correctly from the UART then by this point a new
// CMUX frame should have emerged, however it hasn't.

0:00:25:491 - U_CELL_CMUX: rx 0 byte(s) (ctrl 5/100, ring 5/405).
0:00:25:492 - U_CELL_CMUX: rx 95 byte(s) (ctrl 100/100, ring 100/405). <<< max-age=7200\r\nContent-Disposition: attachment; filename=mgaoffline.ubx\r\nX-Cache: Miss from clo
0:00:25:501 - U_CELL_CMUX: rx 33 byte(s) (ctrl 38/100, ring 38/405). <<< udfront\r\nVia: 1.1 [44][f9]

// Another CMUX frame should, by now, have arrived, but the code thinks not

0:00:25:503 - U_CELL_CMUX: rx 0 byte(s) (ctrl 5/100, ring 16/405).
0:00:25:503 - U_CELL_CMUX: rx 95 byte(s) (ctrl 100/100, ring 111/405). <<< [f9][05][ef][00][01]|70ee39257364131aeb08a57b30a5dfb4.cloudfront.net (CloudFront)\r\nX-Amz-Cf
0:00:25:512 - U_CELL_CMUX: rx 33 byte(s) (ctrl 38/100, ring 144/405). <<< -Pop: BRU50-C1[0d][0a]X-Amz-Cf-Id: XXqh
0:00:25:514 - U_CELL_CMUX_1: rx frame type 0xef.

// The code signals that a new CMUX frame has arrived: from what we know of
// the UART data it should not have, however since this is a valid 128 byte frame
// we can infer that the ones in-between did not pass FCS checking

0:00:25:514 - U_CELL_CMUX_1: writing 128 byte(s) of decode I-field, buffer 0/1024.

0:00:25:515 - U_CELL_CMUX: rx 0 byte(s) (ctrl 5/100, ring 8/405).
0:00:25:515 - U_CELL_CMUX_1: app read 128 byte(s).
...

In fact, the suspicious bit is at the start of the start, the reception of 29 bytes. All the others fall into 95+33 = 128 byte patterns, so it seems like data was lost there, which would likely have caused the loss of two CMUX frames: a length indication pointing somewhere forward, a bit going missing, the length then landing us in the middle of the next CMUX frame, which of course fails FCS checking, until it can re-synch on an 0xF9 framing character.

And of course, the content length, the most important thing, would likely have been in those blocks.

Next question is, why is data being lost? According to the .sal file, it takes 237 ms for the whole "Rx:" lot above to travel across the UART (50.861308 to 51.098824). In the same period the timestamp of the CMUX debug traces go from 25:331 to 25:573 (242 ms), so it doesn't seem as though the MCU processing is lagging. The buffers were all empty not very long before the data loss occurred, so buffer size doesn't seem to be a problem, at least not of the visible buffers, unless there is a UART buffer below this which might present a limitation.

Can you obtain any diagnostics from the UART level below, e.g. framing error or the like?

arnoutdekimo commented 7 months ago

So,

I spent quite some more time trying to get to the bottom of this. Pretty elusive issue, but, as mentioned in https://github.com/u-blox/ubxlib/issues/163#issuecomment-1814033494, there are clear triggers to it.

I in fact reported https://github.com/u-blox/ubxlib/issues/169, and noticed that that issue also went away in similar ways than this one. It was easier to debug (simpler scenario), so I focussed on that.

What I did:

Started from a simple socket scenario, trying to fetch a file. (cmux was enabled)
Hosted my own file, with a known test pattern of about 3 kB on my server
In my code, opened a socket, and did an HTTP transfer (no httplib) over the socket directly, trying to fetch the data

=> Sometimes this would hang for 10s, sometimes data would miss out, etc. => So I set out to enable more debugging, checking what the module sent back etc, checking if perhaps there was an issue in my UART handling ettc.

I made a branch with code, where I can very easily reproduce the issue.

And then, things got interesting

Because CMUX handling .. is complex to follow in the code, I disabled it (by not callinguNetworkInterfaceUp of U_NETWORK_TYPE_GNSS)
=> This had the sideeffect of creating another issue, namely that the module would shut down in an unexpected state.
=> That then led to the actual root cause of why the weirdness in combination with the cmux is happening

The root cause, seems to be a glitch (caused by me) on the PWRON pin in my code, and the effect on the module in combination with the cmux setting.

The glitch was caused by my init code first initing the pwron pin to low, then toggling it for a few seconds to try to reboot it, and then handling control to ubxlib. Ubxlib then tries some AT commands, and eventually toggles it one more time to get it booted:

For some unknown reason, this causes the module to behave differently, and causes parsing errors. Perhaps the data itself that the module sends is wrong, perhaps it changes timing. I did not further investigate. But in any case, after doing this glitch reset thing, all sort of weird issues arise.

When we do NOT switch the module initally to CMUX, the behavior is oddly different, and we run into another issue:

The issue being that, during the ubxlib init, suddenly the module's outbound UART line goes low at a random point. This behavior is clearly the module doing something weird, and actually led me to discover that it is caused by the glitch in the first place.

If I don't do the glitch, everything just works. I can't reproduce any of the issues (hanging cmux, crashes due to invalid data, ..)

It was pretty unfortunate that this "glitch" caused mal behavior of the module, in such a sutble way when it was previously in CMUX mode. Instead of just "not working". it sort of worked. This led me to believe the ubxlib parsing was not working reliably etc.

If we leave a ticket open, it could be for reporting this weird behavior when glitching the pin I guess..

But for the rest, I will just only properly control it, and test from there.

Thanks anyway for all the help so far!

RobMeades commented 7 months ago

Well investigated! I have no explanation for the odd module behaviour when toggling the PWR_ON line like that: as [maybe?] indicated, that pin is connected directly to the power control chip within the module, pure HW stuff, but I couldn't tell you what the chip might do under those circumstances, or how it could possibly have an effect so much later.

There is definitely an "unsafe" time between the module powering up, internally, and it being sensible to talk to it (we choose a value for this in the ubxlib code when we add support for a new module, and ubxlib always waits this long before engaging in conversation with it): in this time the module's responsiveness to AT commands may be "spotty", or responses may be delayed/queued for so long that ubxlib gets confused as to what OK is responding to what command (that it had probably already given up on), or ERROR responses may be emitted because the module is not quite ready for the whatever-it-is AT command.

Anyway, let's leave this open for a while, as you say. I'll close it eventually when there is clearly nothing more to be contributed.

RobMeades commented 6 months ago

Hi Arnout: I'm going to close this now, please feel free to re-open it (or open another issue) if there is more to discuss.

u-blox / ubxlib

Issue during httplib calls for fetching assisted data when CMUX is active - needed for assistnow #163