fix data transfer bottlenecks

nitely commented 1 month ago

fix enough bottlenecks for async dispatch to be a bottleneck.

Start server with profiler

$ nim c --debugger:native --threads:off -d:danger -d:useMalloc --mm:refc -o:bin/localServer examples/localServer.nim && valgrind --tool=callgrind -v ./bin/localServer

then send data to a single stream

nim c -r -d:release examples/dataStream.nim

nitely commented 1 month ago

~current speed is 10MB/s~

[x] ~removing the control flow improves things 10x, speeds goes up to 100MB/s~. I improved the perf so it sends a single window update frame every 1GB of data, and removed local control flow limits.
[x] ~replacing the add copy (search XXX x_x) by memCopy speeds up to 150MB/s~ memCopy speeds up to 300MB/s, but ended up just improving the copy so it gets to 200MB/s without unsafe code.
[ ] ~then it's probably creating a single frame per stream and reusing it that may improve something. There is a los of mem set and mem copys at this point.~ not sure how this can be done. We could keep a seq of frames in the client for reuse but idk.

Replacing setLen by setLenUninit seems to help ORC.

nitely commented 1 month ago

profiling showed SSL stuff as the next bottleneck. I tried removing all SSL related code, and the next bottleneck is the add copy, I changed that to use moveMem that speeds up to 500MB/s.

8eeb60d5bed95ebb8aa2f3dd672c8f64cdbc6ffa

So, I think the SSL wrapper needs to be improved to allocate less and go from there.

nitely commented 1 month ago

I tried yasync and got a 2x speed up when running h2load with a high number of streams -- unrelated to single data transfer speed.

changes are here https://github.com/nitely/nim-hyperx/compare/master...futurevar

load test command ./h2load -n100000 -c10 -m1000 -t2 https://127.0.0.1:4443

nitely commented 1 month ago

I got asyncdispatch on ORC to reach 430MB/s by increasing the socket buffer from 8KB to 64KB, ~and reusing the BIO buffer (but reusing bio buffer cannot be done safely).~

On refc it reaches +500MB/s. On yasync + ORC it reaches +700MB/s.

https://github.com/nitely/nim-hyperx/compare/asyncsockssl?expand=1 https://github.com/nitely/nim-hyperx/compare/experiment?expand=1

I haven't check how much it affects latency on many streams. But even a 16KB buffer gives a ~2x improvement.

nitely commented 1 month ago

I removed the queue for writes, and the send lock. Checking the asyncdispatch code it seems safe to make concurrent send calls, and at least POSIX allows it at the OS level. It's 2-3x faster on h2load bench for every load I tried. This won't improve single stream data transfer, though.

A queue may block user code less as long as it's not full, but thinking about it, if the user code running between sends is really fast it'll eventually fill the queue and block, and if it's too slow then it does not matter much. The more streams the more likely the queue gets full. There may be a bench that shows code that runs exactly the right amount for a queue to help, but it seems artificial.

nitely commented 2 weeks ago

I added flow-control for streams back (#12), so data transfer is back to 10MB/s. Increasing the window size to 256KB seems to increase it to ~200MB/s. But cannot do that until flow-control for the connection is implemented.

nitely / nim-hyperx

fix data transfer bottlenecks #5