network-quality / goresponsiveness

A draft-ietf-ippm-responsiveness client in Go.
GNU General Public License v2.0
133 stars 10 forks source link

Load-generating connection in unrecognizable state #48

Closed hawkinsw closed 1 year ago

hawkinsw commented 1 year ago

A report from the very helpful @tillmueller:

I just tried with the client based on https://github.com/network-quality/goresponsiveness/commit/a0e0b1861d5b2d0d77e728042c915f5b7742e744, but I ran into issues again, this is the stderr output:

download: Load-generating connection with id 62 is in an unrecognizable state.
panic: download: Load-generating connection with id 62 is in an unrecognizable state.

goroutine 11 [running]:
github.com/network-quality/goresponsiveness/rpm.LoadGenerator.func1()
    /root/goresponsiveness/rpm/rpm.go:309 +0xee9
created by github.com/network-quality/goresponsiveness/rpm.LoadGenerator
    /root/goresponsiveness/rpm/rpm.go:238 +0x1e5

And here's stdout: stdout.log

I also want to stress that if there's anything you think might be wrong with my setup here, please let me know, just in case that's a possibility.

Thanks again very much for your help!

Till

Originally posted by @TillMueller in https://github.com/network-quality/goresponsiveness/issues/47#issuecomment-1523392396

hawkinsw commented 1 year ago

Hello @tillmueller -- thanks again for reporting this issue! Please let me know if we got the issue toasted with this commit!

TillMueller commented 1 year ago

Hello,

Yes, thank you for looking into it! I've just done some preliminary testing based on b0ebad5 and I'm still seeing similar error messages on about half the clients when running 20 at the same time.

stderr:

(SelfDown Probe) Probe 1288 at 2023-05-22 22:08:33.549066297 +0200 CEST m=+40.838867455 with info {0xc001d57c00 false false 0s} did not properly reuse a connection.
panic: true

goroutine 5102 [running]:
github.com/network-quality/goresponsiveness/probe.Probe({0x81caa0, 0xc000062230}, 0xc0000a8770?, 0xc00022c090, {0xc0000b60f0, 0x21}, {0xc0020d2fd0?, 0x6d32a6?}, 0x1, 0xc00009a0c8, ...)
    /root/goresponsiveness/probe/probe.go:146 +0xa97
created by github.com/network-quality/goresponsiveness/rpm.CombinedProber.func1
    /root/goresponsiveness/rpm/rpm.go:168 +0x7c5

stdout: stdout.log

Let me know if you need anything else to investiage this and thanks again!

Till

hawkinsw commented 1 year ago

Very interesting! I will investigate tonight! Sorry again for the trouble! I'd love to hear if you are having any success with getting numbers/data from it! Please let me know how we can help (besides, of course, not writing bad code ;-)) !

Will

hawkinsw commented 1 year ago

@TillMueller Sorry to have to ask a follow-up question: In the log that you attached, was that output from all the clients running simultaneously or just the one that failed? I am seeing some very odd output in the log (that I would really not expect). If the logs are from a single client then I will need to go one direction; if they are from several, I will need to go another! Sorry for having to ask for additional input!

Thank you, as always!

hawkinsw commented 1 year ago

Hello @TillMueller -- thanks, as always! I pushed a commit that adds some additional debugging output. I am unable to reproduce the problem locally but I get the sense that I know what is going on. If you could rerun with the latest version (commit https://github.com/network-quality/goresponsiveness/commit/d4a0d400be96b8dda43096d41bdd6a1fc1fd4245) and see what happens, I would really appreciate it! If you see the panic still happening (you will because I didn't make any substantive changes -- just some logging output), if you can send stdout and stderr that would be tremendous!

Thank you, again, for your help!

TillMueller commented 1 year ago

Hi @hawkinsw - apologies I didn't respond earlier; I finally got around to running the tests again. First, to answer your question: This should only be output from a single client. I'm using Ansible for orchestration and log Ansible's output to a file on the orchestration node. Barring anything going wrong during that, this and all previous log files are the output of a single node.

stderr.log stdout.log

Thank you very much for your continued help, I really appreciate it!

Till

hawkinsw commented 1 year ago

Thank you tremendously! There is some very odd behavior in these logs. Although this is akin to blaming the compiler (because it is never a bug in the compiler), I am leaning toward thinking that there is some oddity with the runtime! Thank you (as always) -- I will keep you posted!

Will

TillMueller commented 1 year ago

Of course, happy to help!

I have to admit that I have only done these tests on this one specific hardware/software combination, but unfortunately I don't really have others available. If there's anything in this setup that you'd like me to share with you or change I'll do my best to accomodate that! I also hope I didn't miss a requirement somewhere that could explain this. I'm using a recent version of Go for the compilation, but the nodes are on Ubuntu 18.04 / kernel 4.15 so they're somewhat older but still technically supported.

Thanks again :)

hawkinsw commented 1 year ago

I believe that you are doing everything correctly! The only thing that would make this situation easier is if I was able to reproduce it myself. I attempted to flood a server last night and only succeeded in bringing down the server.

If you are able, can you describe the link between client and server? What are its properties in terms of loss, etc? That might help me!

Thank you so much! Will

TillMueller commented 1 year ago

Certainly, that's the part we're primarily testing: The clients are all connected to a single 5 GHz WiFi network. To summarize, we're looking into the behavior of networking stacks when used on congested WiFi networks, leading us to this setup. Therefore, packet loss is certainly happening, possibly also reordering. The WiFi is created by an access point which also does NAT (likely not relevant, but happens to be required in our case) and through an ethernet network the data reaches the server from there.

Does that answer your question?

Thank you!

Till

hawkinsw commented 1 year ago

Absolutely incredible! Thank you so much! I will keep you posted as I work on it tonight! Will

hawkinsw commented 1 year ago

Just another quick question that will help me with a hypothesis: Which server are you running? Related, while I have your ear, do you have packet captures for these test results? Having that would also help me debug!

Thank you tremendously!

Will

TillMueller commented 1 year ago

I'm using https://github.com/network-quality/server/tree/main/go/server (commit 94009a5).

Certainly, I can have a packet capture running while executing the tests, although that might have to wait until tomorrow. Would you rather have me capture on the client, on the server, or both?

Thanks!

Till

hawkinsw commented 1 year ago

Hello! Thank you for the quick feedback. I would love a packet capture from the client because I bet that will be easier to decipher (and see only the packets to/from the faulting client). Please do not worry about any delay! I am sure that you are busy with plenty of other things!

I have created a branch with something that might fix the problem. If my hypothesis is correct, then applying the fix in https://github.com/network-quality/goresponsiveness/tree/strict_max_concurrent may just do the trick. If you would not mind, could you give that branch a try?

Thank you, again! Will

TillMueller commented 1 year ago

Hello,

I just ran the tests again based on the strict_max_concurrent branch (commit 0ff5daf) and none of the clients crashed! Nevertheless, in case you want to validate, here are the packet capture and stdout of one of the clients: stdout.log (GitHub didn't let me upload the ~250MB packet capture) https://syncandshare.lrz.de/getlink/fiWA7dpbgAtXATYaHC4w8h/go-responsiveness-testing.pcap

Thank you very much again for looking into it!

Till

hawkinsw commented 1 year ago

WOHOO!! Absolutely amazing news. I am fairly confident that was the problem. I will download the capture and verify. If it is okay with you, I will merge that to main and then close this issue and we can proceed with fixing the other bugs that I am sure you will uncover!

Thank you so much for doing all the testing. If there is anything I can do to help you with your research, please let me know!

Will

TillMueller commented 1 year ago

Sounds good to me, thanks again and I'll let you know if I find anything else! :)

Till