raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.14k stars 4.99k forks source link

TCP connections stall, then kernel hangs on reboot request #2709

Closed scottmayo closed 4 years ago

scottmayo commented 6 years ago

Raspberry Pi 3B+, Linux lachesis 4.14.50-v7+ #1122 SMP Tue Jun 19 12:26:26 BST 2018 armv7l

This node is part of a cluster of processors (not all pi's) that open TCP connections to each other, some transiently and some long term (potentially months). There are perhaps a dozen or less connections at any one time. Traffic over them is light - writes amount to a dozen bytes to a couple hundred at most, generally every few seconds. wget is sometimes used to pull in a few hundred thousand bytes, rate limited and at slow intervals. One socket streams music on occasion and sends much more traffic. Bottom line, there's nothing challenging going on, and load averages are usually below 0.10 even when streaming.

Randomly, after some number of days, one (maybe more) TCP connection freezes - no data transferred. From that point on, other sockets may or may not work; I generally can't get an ssh session into the pi, but I can sometimes request a reboot over an existing one. The application that handles the reboot request definitely receives it because it shuts itself down, but the pi has to be power cycled to be recovered.

There's no obvious rhyme or reason to what socket(s) fail, for example it doesn't seem to happen more often when streaming music. I suspect a race condition in the TCP stack but I have no evidence. I can go several weeks without an issue.

I'm experienced enough with TCP to know I'm not doing anything wonky with the sockets - other than turning Nagle off on some, this is very vanilla code. I have not seen similar behaviour on my other pies (which are not 3B+) so I wonder if there's a multiprocessor issue.

Major issues for me; people indirectly using these pi's don't know how to reboot them and have no direct interaction with them except "things in the house stop working."

pelwell commented 6 years ago

There are have been several improvements to the LAN7800 driver since the kernel version you are running. Please run sudo rpi-update to get the latest version and see if the connections are more reliable.

JamesH65 commented 5 years ago

@scottmayo Did updating to the latest driver fix this issue?

This issue will be closed within 30 days unless further interactions are posted. If you wish this issue to remain open, please add a comment. A closed issue may be reopened if requested.

scottmayo commented 5 years ago

I don't know. rpi-update puts up a terrify warning warning about pre-releases of linux and firmware. This pi runs important infrastructure and if it doesn't work I will annoy a number of people. Better to have to reboot it every few weeks because 1 service isn't working, than to brick it and lose quite a few services for a long time.

What I am waiting for is "oh, we found a race condition in the TCP stack and fixed it" as opposed to "could you be a good guinea pig and try out these still-experimental fixes?"

I will try the standard upgrade and update. I'd assume, after all this time, that I'll get the new drivers that way?

JamesH65 commented 5 years ago

You should never update a critical system without testing on a offline one first, which is what I would suggest in this case. Testing with the current release will be sufficient, no need to use rpi-update.

JamesH65 commented 5 years ago

@scottmayo Any test results?

This issue will be closed within 30 days unless further interactions are posted. If you wish this issue to remain open, please add a comment. A closed issue may be reopened if requested.

JamesH65 commented 4 years ago

Closing due to lack of activity. Please request to be reopened if you feel this issue is still relevant.