u-blox / ubxlib

Portable C libraries which provide APIs to build applications with u-blox products and services. Delivered as add-on to existing microcontroller and RTOS SDKs.
Apache License 2.0
287 stars 82 forks source link

SARA-R4 module closes socket unexpectedly #165

Open arnoutdekimo opened 7 months ago

arnoutdekimo commented 7 months ago

Hi,

I ran into an unexpected issue while trying to fetch a large file over HTTP, using the socket lib. (This relates a bit to https://github.com/u-blox/ubxlib/issues/161, where also the reference to some example code was put).

Scenario:

The issue:

Sometimes this works.. But sometimes, it doesn't, and in this case, I see unexpected things:

In the ubxlib log, I see:

0:01:24:783 - AT+USORD=0,1024 0:01:24:898 - 0:01:24:938 - 0:01:24:938 - +USORD: 0,1024,"o GET file "%s"; status code was %d![00][00][00]WARNING: APPLYING HARDCODED DEFAULTS![00][00][00][00][00][00]web.be[00][00]%s[00][00]chat.layers.be:8080[00]HCJx9y-_RKepara6bQJ3pw[00][00]afe23915adf799a51e5792ba3ea27c7a9385d76b[00][00][00][00]%u: %s Built %s %s[00][00]15:21:56[00][00][00][00]Nov 13 2023[00]internet[00][00][00][00]m2m.business[00][00][00][00]cmnet[00][00][00]cmATclientreadbytes-READrequest 1 bytes 0:01:24:948 - ATclientreadbytes-result 1 bytes 0:01:24:948 - ATclientreadbytes-READrequest 1024 bytes 0:01:24:948 - wap[00][00][00][00]3gnet[00][00][00]uninet[00]uninet[00]uninet[00][00]internet.t-mobile[00]t-mobile[00]tm[00][00]ibox.tim.it[00][00][00][00]web.omnitel.it[00][00][00][00]internet.wind.biz[00][00][00][00]open.softbank.ne.jp[00]opensoftbank[00]ebMNuX1FIHg9d3DA[00]smile.world[00]dna1trop[00]so2t3k3m2a[00][00]bmobilewap[00][00][00]mpr2.bizho.net[00]Mopera U[00][00]bmobile.ne.jp[00]bmobile@wifi2[00]bmobile[00][00]public4.m2minternet.com[00][00][00][00]internet.simobil.si[00][00][00][00]internet.tusmobil.si[00][00][00][00]online.telia.se[00][00][00][00]services.telenor.se[00][00][00][00]mobileinternet.tele2.se[00][00][00][00]gprs.swisscom.ch[00][00][00][00]internet[00][00][00]click[00][00][00][00]mobile.o2.co.uk[00]faster[00]web[00]mobile.o2.co.uk[00]bypass[00]web[00]payandgo.o2.co.uk[00]payandgo[00]payandgo[00][00]internet[00]web[00]web[00]pp.vodafone.co.uk[00]wap[00]wap[00][00]three.co.uk[00][00][00][00]jtm2m[00][00][00][00]epc.tmobile.com[00][00][00]fast.tmobile.com[00][00][00][00]phone[00][00][00]wap.cingular[00]WAP@CINGULARGPRS.COM[00]CINGULAR1[00]isp.cingular[00]ISP@CINGULARGPRS.COM[00]CINGULAR1[00][00]n" 0:01:25:033 - 0:01:25:033 - OK 0:01:25:033 - 0:01:25:033 - +UUSORD: 0,13448 0:01:25:034 - 0:01:25:034 - +UUSOCL: 0 0:01:25:034 - ATclientreadbytes-result 1024 bytes 0:01:25:035 - AT+USORD=0,1024 0:01:25:149 - 0:01:25:189 - +CME ERROR: Operation not allowed

Note the +UUSOCL: 0 URC here. This seems to indicate that the remote closed the session. However, since I run my own HTTP server, I took some wireshark captures of the actual data. This does not align..

image

6.9s after the first packet was captured, the big transfer starts from the LTE modem. In my logs, this corresponds to the +- 34 second mark:

0:00:33:594 - +USOCR: 0 0:00:33:594 - 0:00:33:594 - OK 0:00:33:595 - U_SOCK: socket created, descriptor 0, network handle 0x20019848, socket handle 0. 0:00:33:595 - U_SOCK: option set command 4102:0x0fff called on descriptor 0 with value [4 int32s] 10 (0x0000000a) 0 (0x00000000) 500000 (0x0007a120) 0 (0x00000000) 0:00:33:596 - U_SOCK: timeout for socket descriptor 0 set to 10500 ms. 0:00:33:596 - U_SOCK: connecting socket to "84.199.6.178:8080"... 0:00:33:596 - AT+USOCO=0,"84.199.6.178",8080 0:00:33:700 - 0:00:33:780 - OK 0:00:33:780 - U_SOCK: socket with descriptor 0, network handle 0x20019848, socket handle 0, is connected to address "84.199.6.178:8080". 0:00:33:780 - AT+USOWR=0,128 0:00:33:883 - 0:00:33:913 - @GET /getOTA.html 0:00:33:974 - 0:00:33:975 - [00][9a][03][08]'[00] [a0]'[00] [00][00][00][00][09][00][00][00]<)[00] [f0]'[00] [a3][99][03][08]Cd[03][08][a3][99][03][08]'[00] [f0]'[00] [1d][9a][03][08]-[a9][03][08][f0]'[00] [03][00][00][00]<)[00] [f4][eb][03][08][01][00][00][00]0[a9][03][08][c7][9a][03][08]\'[00] [94]([00] [00][00][00][00]p([00] [84][00][00][00][ff][ff][ff][ff] 0:00:34:019 - +USOWR: 0,128 0:00:34:019 - 0:00:34:020 - OK 0:00:34:020 - AT+USORD=0,0 0:00:34:122 - 0:00:34:162 - 0:00:34:162 - 0:00:34:162 - 0:00:34:162 - +USORD: 0,0 0:00:34:163 - OK 0:00:34:163 - 0:00:34:238 - +UUSORD: 0,1376

image

I then see, that the module sends a FIN, 35 seconds after the fetching of data!

This is totally unexpected. In fact, on AT, the transfer just keeps going on and on (it's just slow), until the 1min 25s mark, when it finally reports a remote-initiated closure.

So, I see two main things here:

If I'd have to guess, this "could" be caused by something like the following:

But this is unexpected no? Am I doing anything wrong here?

Thanks for your feedback, Arnout

arnoutdekimo commented 7 months ago

[ socket-error-faileddownload.zip In attachment a salae capture of this

arnoutdekimo commented 7 months ago

Pff, and now I am getting even different behavior:

The logs look like this now:

0:01:02:244 - U_CELL_SOCK: found it at "84.199.6.178". 0:01:02:244 - AT+USOCR=6 0:01:02:345 - 0:01:02:425 - +USOCR: 0 0:01:02:425 - 0:01:02:426 - OK 0:01:02:426 - U_SOCK: socket created, descriptor 0, network handle 0x20019890, socket handle 0. 0:01:02:426 - U_SOCK: option set command 4102:0x0fff called on descriptor 0 with value [4 int32s] 10 (0x0000000a) 0 (0x00000000) 500000 (0x0007a120) 0 (0x00000000) 0:01:02:427 - U_SOCK: timeout for socket descriptor 0 set to 10500 ms. 0:01:02:427 - U_SOCK: connecting socket to "84.199.6.178:8080"... 0:01:02:427 - AT+USOCO=0,"84.199.6.178",8080 0:01:02:531 - 0:01:03:121 - OK 0:01:03:121 - U_SOCK: socket with descriptor 0, network handle 0x20019890, socket handle 0, is connected to address "84.199.6.178:8080". 0:01:03:122 - AT+USOWR=0,128 0:01:03:224 - 0:01:03:254 - @GET /getOTA.html 0:01:03:316 - 0:01:03:316 - [00][9a][03][08]'[00] [a0]'[00] [00][00][00][00][07][00][00][00]<)[00] [f0]'[00] [a3][99][03][08]Cd[03][08][a3][99][03][08]'[00] [f0]'[00] [1d][9a][03][08]-[a9][03][08][f0]'[00] [03][00][00][00]<)[00] [f4][eb][03][08][01][00][00][00]0[a9][03][08][c7][9a][03][08]\'[00] [94]([00] [00][00][00][00]p([00] [84][00][00][00][ff][ff][ff][ff] 0:01:03:330 - +USOWR: 0,128 0:01:03:350 - 0:01:03:350 - OK 0:01:03:351 - AT+USORD=0,0 0:01:03:453 - 0:01:03:493 - 0:01:03:493 - 0:01:03:493 - 0:01:03:493 - +USORD: 0,0 0:01:03:494 - OK 0:01:03:494 - AT+USORD=0,0 0:01:03:596 - 0:01:03:636 - 0:01:03:636 - 0:01:03:636 - 0:01:03:636 - +USORD: 0,0 0:01:03:637 - OK 0:01:03:637 - AT+USORD=0,0 0:01:03:739 -

...

And then every so ofton data does get through for a moment

0:01:09:356 - 0:01:09:356 - 0:01:09:356 - 0:01:09:356 - +USORD: 0,0 0:01:09:357 - OK 0:01:09:357 - 0:01:09:452 - +UUSORD: 0,1376 0:01:09:453 - AT+USORD=0,1024 0:01:09:564 - 0:01:09:604 - 0:01:09:604 - +USORD: 0,1024,"[00][00][04] [95][05][01][08]I[c4][00][08]i[c4][00][08][89][c4][00][08][a9][c4][00][08][c9][c4][00][08][e5][05][01][08][00][00][00][00][00][00][00][00][00][00][00][00][a1][02][02][08][e9][c4][00][08][00][00][00][00]A[02][02][08][d5][c6][01][08][e5][05][01][08][e5][05][01][08][f9][c4][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08] 0:01:09:611 - [c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]![c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]5[c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05]ATclientreadbytes-READrequest 1 bytes 0:01:09:616 - ATclientreadbytes-result 1 bytes 0:01:09:616 - ATclientreadbytes-READrequest 1024 bytes 0:01:09:616 - [01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]I[c5][00][08]][c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08]q[c5][00][08][00][00][00][00][85][c5][00][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][00][00][00][00][e5][05][01][08][e5][05][01][08][e5][05][01][08][00][00][00][00][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][00][00][00][00][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][00][00][00][00][00][00][00][00][00][00][00][00][e5][05][01][08][e5][05][01][08][00][00][00][00][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01][08][e5][05][01]+[11][f8][01];[01]([bf][9a]B[f7][d0][d0][1a]pG[03]F[13][f8][01]+[00][fb][d1][18][1a][01]8pG[81][f0][00]A[02][e0][00][bf][83][f0][00]C0[b5]O[ea]A[04]O[ea]C[05][94][ea][05][0f][08][bf][90][ea][02][0f][1f][bf]T[ea][00][0c]U[ea][02][0c][7f][ea]d[7f][ea]e[00][f0][e2][80]O[ea]TT[d4][eb]UU[b8][bf]mB[0c][dd],D[80][ea][02][02][81][ea][03][03][82][ea][00][00][83][ea][01][01][80][ea][02][02][81][ea][03][03]6-[88][bf]0[bd][11][f0][00]OO[ea][01]1O[f4][80][1c]L[ea][11]1[02][d0]@Ba[eb]A[01][13][f0][00]OO[ea][03]3L[ea][13]3[02][d0]RBc[eb]C[03][94][ea][05][0f][00][f0][a7][80][a4][f1][01][04][d5][f1] [0e] 0:01:09:680 - [db][02][fa][0e][fc]"[fa][05][f2][80][18]A[f1][00][01][03][fa][0e][f2][80][18]C[fa][05][f3]YA[0e][e0][a5][f1] [05][0e][f1] [0e][01]*[03][fa][0e][fc]([bf]L[f0][02][0c]C[fa][05][f3][c0][18]Q[eb][e3]q[01][f0][00]E[07][d5]O[f0][00][0e][dc][f1][00][0c]~[eb][00][00]n[eb][01][01][b1][f5][80][1f][1b][d3][b1][f5][00][1f][0c][d3]I[08][ea1][01]<([bf][b1][f5][80][1f][e9][d2][91][f0][00][0f][04][bf][01]F[00] [b1][fa][81][f3][08][bf] 3[a3][f1][0b][03][b3][f1]" 0:01:09:688 - ATclientreadbytes-result 1024 bytes 0:01:09:688 - 0:01:09:708 - OK 0:01:09:708 - 0:01:09:708 - +UUSORD: 0,352 0:01:09:709 - AT+USORD=0,1024 0:01:09:823 - 0:01:09:863 - 0:01:09:863 - +USORD: 0,352," [02][0c][da][0c]2[08][dd][02][f1][14][0c][c2][f1][0c][02][01][fa][0c][f0]![fa][02][f1][0c][e0][02][f1][14][02][d8][bf][c2][f1] [0c][01][fa][02][f1] [fa][0c][fc][dc][bf]A[ea][0c][01][90]@[e4][1a][a2][bf][01][eb][04]Q)C0[bd]o[ea][04][04][1f]<[1c][da][0c]4[0e][dc][04][f1][14][04][c4][f1] [02] [fa][04][f0][01][fa][02][f3]@[ea][03][00]![fa][04][f3]E[ea][03][01]0[bd][c4][f1][0c][04][c4][f1] [02] [fa][02][f0][01][fa][04][f3]@[ea][03][00])F0[bd]![fa][04][f0])F0[bd][94][f0][00][0f][83][f4][80][13][06][bf][81][f4][80][11][01]4[01]=N[e7][7f][ea]d[18][bf][7f][ea]e)[d0][94][ea][05][0f][08][bf][90][ea][02][0f][05][d0]T[ea][00][0c][04][bf][19]F[10]F0[bd][91][ea][03][0f][1e][bf][00]![00] 0[bd][ea]T[05][d1]ATclientreadbytes-READrequest 1 bytes 0:01:09:872 - ATclientreadbytes-result 1 bytes 0:01:09:872 - ATclientreadbytes-READrequest 352 bytes 0:01:09:872 - @[00]IA([bf]A[f0][00]A0[bd][14][f5][80][04]<[bf][01][f5][80][11]0[bd][01][f0][00]EE[f0][fe]AA[f4]p[01]O[f0][00][00]0[bd][7f][ea]d[1a][bf][19]F[10]F[7f][ea]e[1c][bf][0b]F[02]FP[ea][01]4[06][bf]R[ea][03]5[91][ea][03][0f]A[f4][00]!0[bd][00][bf][90][f0][00][0f][04][bf][00]!pG0[b5]O[f4][80]d[04][f1]2[04]O[f0][00][05]O[f0][00][01]P[e7][00][bf][90][f0][00][0f][04][bf][00]!pG0[b5]O[f4][80]d[04][f1]2[04][10][f0][00]E" 0:01:09:898 - 0:01:09:898 - OK 0:01:09:899 - ATclientreadbytes-result 352 bytes 0:01:09:899 - AT+USORD=0,0 0:01:10:002 - 0:01:10:042 -

0:01:10:042 - 0:01:10:042 - 0:01:10:042 - +USORD: 0,0 0:01:10:043 - OK 0:01:10:043 - AT+USORD=0,0 0:01:10:145 - 0:01:10:185 - 0:01:10:185 - 0:01:10:185 - 0:01:10:185 - +USORD: 0,0 0:01:10:186 - OK 0:01:10:186 - AT+USORD=0,0 0:01:10:288 -

The frustrating part is that all seems to go well on the internet side of things, but that on the packetization on the AT interface stuff goes really slow, and I don't understand why?

RobMeades commented 7 months ago

Hi there: first thing to say is thanks for the very detailed report, second thing to say is that, since this cellular, there is the cellular network's firewall to consider in the equation; whether it is anything to do with this or not I don't know, just thought I'd mention it. And of course there is a radio path in the way.

Comments/questions:

(1) the +UUSORD: 0,xxxx you can see is the module indicating how many bytes it already has buffered, internally within the module, received from the far end. In your first log there is a lot of data buffered at the point that the +UUSOCL is received:

0:01:25:033 - +UUSORD: 0,13448
0:01:25:034 -
0:01:25:034 - +UUSOCL: 0

This aligns with your suggestion that the process of reading data from the module is lagging the IP traffic significantly. Of course, the IP behaviour within the module will have a different timing to the activity logged at that AT interface; the "wireshark equivalent" is inside the module somewhere.

I don't know how large the module's internal IP buffers are for received data but I'd guess that, at some point, it must stop receiving IP traffic from the server and wait for the MCU to read the stuff it has buffered out, and it may be that your 10 second socket timeout kicks-in. I don't have my copy of Stevens et al to hand, but I would guess that the socket option to do a receive timeout is set locally, at "your" end of the IP link, so if that link does nothing for 10 seconds then the module should terminate the IP link (not sure what else it would do)?

(2) concerning the variability of throughput, it would be worth you putting in a few calls to check what the radio environment is like, in case that is a factor. Just call uCellInfoRefreshRadioParameters() every so often, maybe between packet receives; no need to do more than that, with logging on as you have it, radio stuff will be printed in the log.

arnoutdekimo commented 7 months ago

Hi,

I don't know how large the module's internal IP buffers are for received data but I'd guess that, at some point, it must stop receiving IP traffic from the server and wait for the MCU to read the stuff it has buffered out, and it may be that your 10 second socket timeout kicks-in. I don't have my copy of Stevens et al to hand, but I would guess that the socket option to do a receive timeout is set locally, at "your" end of the IP link, so if that link does nothing for 10 seconds then the module should terminate the IP link (not sure what else it would do)?

So coming back to this, what the module -should- normally do is the following:

At no point in time, should the module decide to -close- a connection, without the application explicitly saying to do so, unless malformed TCP packets or whatever were to arrive.

Coming back to your question regarding the socket timeout - This is normally used for a the blocking read call. The read call will normally block forever, until some data is available, some errror on the socket happens (e.g. remote closes the connection), OR a timeout happens. But again, this timeout should never cause a connection to be actively closed by the module/OS.

arnoutdekimo commented 7 months ago

FYI, this is the log file with after every succesful read, doing the ucell parameter refresh successfulltransfer.zip

Note that this time it succeeded. I'll run it a couple of more times and add a failure as well

arnoutdekimo commented 7 months ago

This is a failure:

failure-withcelldbg.zip socketerrorlog.zip

RobMeades commented 7 months ago

Ah, yes, I should have checked the code, the U_SOCK_OPT_LEVEL_SOCK U_SOCK_OPT_RCVTIMEO is purely a local thing, nothing to do with the module at all.

So we are back to what reason there might be for an +UUSOCL. The signal strength and SNR from your radio parameters calls are all good, and stable as well, so no problem there. I've plotted the amount of data that +UUSORD reports in your socketerrorlog.zip and it is pretty constant, no obvious problem there:

image

There are 192 of them, so likely around 192 kbytes downloaded when the +UUSOCL occurs. I am not aware of a way to query the module for a socket error code once a socket has been closed but then again I wouldn't: I have posted a question to the relevant person internally.

arnoutdekimo commented 7 months ago

Thanks!

RobMeades commented 7 months ago

OK, so the exact person I need, who is specifically expert in the IP stack implementation of SARA-R422, is out of the office until Monday but, generically, I have confirmed that, to get a cause out of the module, will require a debug trace from the debug trace port (USB) of the module. For you to be able to get that, in the SARA-422 case, assuming you have access to the USB port, would unfortunately also need an NDA.

Alternatively, I can try doing exactly the same thing with a SARA-R422 from my desk here and hope the problem occurs for me also: I guess I can just bounce off your server and pull that same file?

arnoutdekimo commented 7 months ago

Hi,

NDA is no problem - feel free to get in touch directly for that. You are free to test with my server (as long as it is up - this is just for testing). Note that because of the speedups from https://github.com/u-blox/ubxlib/issues/159 I now don't see the issue anymore (for now) (I suspect because the transfer happens in some +- 40s instead of 1.5 minutes).

Kind regards, Arnout

RobMeades commented 7 months ago

The +UUSOCL issue being duration-related is kinda weird, not what I expected. I guess it could be timing related and that has moved the goal posts. Anyway, please drop me a line, rob.meades@u-blox.com, and we can sort the NDA in any case; might be useful for you to have module logging running.