Closed hawkinsw closed 1 year ago
Hello @tillmueller -- thanks again for reporting this issue! Please let me know if we got the issue toasted with this commit!
Hello,
Yes, thank you for looking into it! I've just done some preliminary testing based on b0ebad5
and I'm still seeing similar error messages on about half the clients when running 20 at the same time.
stderr
:
(SelfDown Probe) Probe 1288 at 2023-05-22 22:08:33.549066297 +0200 CEST m=+40.838867455 with info {0xc001d57c00 false false 0s} did not properly reuse a connection.
panic: true
goroutine 5102 [running]:
github.com/network-quality/goresponsiveness/probe.Probe({0x81caa0, 0xc000062230}, 0xc0000a8770?, 0xc00022c090, {0xc0000b60f0, 0x21}, {0xc0020d2fd0?, 0x6d32a6?}, 0x1, 0xc00009a0c8, ...)
/root/goresponsiveness/probe/probe.go:146 +0xa97
created by github.com/network-quality/goresponsiveness/rpm.CombinedProber.func1
/root/goresponsiveness/rpm/rpm.go:168 +0x7c5
stdout
:
stdout.log
Let me know if you need anything else to investiage this and thanks again!
Till
Very interesting! I will investigate tonight! Sorry again for the trouble! I'd love to hear if you are having any success with getting numbers/data from it! Please let me know how we can help (besides, of course, not writing bad code ;-)) !
Will
@TillMueller Sorry to have to ask a follow-up question: In the log that you attached, was that output from all the clients running simultaneously or just the one that failed? I am seeing some very odd output in the log (that I would really not expect). If the logs are from a single client then I will need to go one direction; if they are from several, I will need to go another! Sorry for having to ask for additional input!
Thank you, as always!
Hello @TillMueller -- thanks, as always! I pushed a commit that adds some additional debugging output. I am unable to reproduce the problem locally but I get the sense that I know what is going on. If you could rerun with the latest version (commit https://github.com/network-quality/goresponsiveness/commit/d4a0d400be96b8dda43096d41bdd6a1fc1fd4245) and see what happens, I would really appreciate it! If you see the panic
still happening (you will because I didn't make any substantive changes -- just some logging output), if you can send stdout
and stderr
that would be tremendous!
Thank you, again, for your help!
Hi @hawkinsw - apologies I didn't respond earlier; I finally got around to running the tests again. First, to answer your question: This should only be output from a single client. I'm using Ansible for orchestration and log Ansible's output to a file on the orchestration node. Barring anything going wrong during that, this and all previous log files are the output of a single node.
Thank you very much for your continued help, I really appreciate it!
Till
Thank you tremendously! There is some very odd behavior in these logs. Although this is akin to blaming the compiler (because it is never a bug in the compiler), I am leaning toward thinking that there is some oddity with the runtime! Thank you (as always) -- I will keep you posted!
Will
Of course, happy to help!
I have to admit that I have only done these tests on this one specific hardware/software combination, but unfortunately I don't really have others available. If there's anything in this setup that you'd like me to share with you or change I'll do my best to accomodate that! I also hope I didn't miss a requirement somewhere that could explain this. I'm using a recent version of Go for the compilation, but the nodes are on Ubuntu 18.04 / kernel 4.15 so they're somewhat older but still technically supported.
Thanks again :)
I believe that you are doing everything correctly! The only thing that would make this situation easier is if I was able to reproduce it myself. I attempted to flood a server last night and only succeeded in bringing down the server.
If you are able, can you describe the link between client and server? What are its properties in terms of loss, etc? That might help me!
Thank you so much! Will
Certainly, that's the part we're primarily testing: The clients are all connected to a single 5 GHz WiFi network. To summarize, we're looking into the behavior of networking stacks when used on congested WiFi networks, leading us to this setup. Therefore, packet loss is certainly happening, possibly also reordering. The WiFi is created by an access point which also does NAT (likely not relevant, but happens to be required in our case) and through an ethernet network the data reaches the server from there.
Does that answer your question?
Thank you!
Till
Absolutely incredible! Thank you so much! I will keep you posted as I work on it tonight! Will
Just another quick question that will help me with a hypothesis: Which server are you running? Related, while I have your ear, do you have packet captures for these test results? Having that would also help me debug!
Thank you tremendously!
Will
I'm using https://github.com/network-quality/server/tree/main/go/server (commit 94009a5
).
Certainly, I can have a packet capture running while executing the tests, although that might have to wait until tomorrow. Would you rather have me capture on the client, on the server, or both?
Thanks!
Till
Hello! Thank you for the quick feedback. I would love a packet capture from the client because I bet that will be easier to decipher (and see only the packets to/from the faulting client). Please do not worry about any delay! I am sure that you are busy with plenty of other things!
I have created a branch with something that might fix the problem. If my hypothesis is correct, then applying the fix in https://github.com/network-quality/goresponsiveness/tree/strict_max_concurrent may just do the trick. If you would not mind, could you give that branch a try?
Thank you, again! Will
Hello,
I just ran the tests again based on the strict_max_concurrent
branch (commit 0ff5daf
) and none of the clients crashed! Nevertheless, in case you want to validate, here are the packet capture and stdout
of one of the clients:
stdout.log
(GitHub didn't let me upload the ~250MB packet capture) https://syncandshare.lrz.de/getlink/fiWA7dpbgAtXATYaHC4w8h/go-responsiveness-testing.pcap
Thank you very much again for looking into it!
Till
WOHOO!! Absolutely amazing news. I am fairly confident that was the problem. I will download the capture and verify. If it is okay with you, I will merge that to main
and then close this issue and we can proceed with fixing the other bugs that I am sure you will uncover!
Thank you so much for doing all the testing. If there is anything I can do to help you with your research, please let me know!
Will
Sounds good to me, thanks again and I'll let you know if I find anything else! :)
Till
A report from the very helpful @tillmueller:
I just tried with the client based on https://github.com/network-quality/goresponsiveness/commit/a0e0b1861d5b2d0d77e728042c915f5b7742e744, but I ran into issues again, this is the
stderr
output:And here's
stdout
: stdout.logI also want to stress that if there's anything you think might be wrong with my setup here, please let me know, just in case that's a possibility.
Thanks again very much for your help!
Till
Originally posted by @TillMueller in https://github.com/network-quality/goresponsiveness/issues/47#issuecomment-1523392396