nitely / nim-hyperx

Pure Nim http2 client and server 🖖
MIT License
23 stars 0 forks source link

fix data transfer bottlenecks #5

Open nitely opened 1 month ago

nitely commented 1 month ago

fix enough bottlenecks for async dispatch to be a bottleneck.

Start server with profiler

$ nim c --debugger:native --threads:off -d:danger -d:useMalloc --mm:refc -o:bin/localServer examples/localServer.nim && valgrind --tool=callgrind -v ./bin/localServer

then send data to a single stream

nim c -r -d:release examples/dataStream.nim
nitely commented 1 month ago

~current speed is 10MB/s~

Replacing setLen by setLenUninit seems to help ORC.

nitely commented 1 month ago

profiling showed SSL stuff as the next bottleneck. I tried removing all SSL related code, and the next bottleneck is the add copy, I changed that to use moveMem that speeds up to 500MB/s.

8eeb60d5bed95ebb8aa2f3dd672c8f64cdbc6ffa

So, I think the SSL wrapper needs to be improved to allocate less and go from there.

nitely commented 1 month ago

I tried yasync and got a 2x speed up when running h2load with a high number of streams -- unrelated to single data transfer speed.

changes are here https://github.com/nitely/nim-hyperx/compare/master...futurevar

load test command ./h2load -n100000 -c10 -m1000 -t2 https://127.0.0.1:4443

nitely commented 1 month ago

I got asyncdispatch on ORC to reach 430MB/s by increasing the socket buffer from 8KB to 64KB, ~and reusing the BIO buffer (but reusing bio buffer cannot be done safely).~

On refc it reaches +500MB/s. On yasync + ORC it reaches +700MB/s.

https://github.com/nitely/nim-hyperx/compare/asyncsockssl?expand=1 https://github.com/nitely/nim-hyperx/compare/experiment?expand=1

I haven't check how much it affects latency on many streams. But even a 16KB buffer gives a ~2x improvement.

nitely commented 1 month ago

I removed the queue for writes, and the send lock. Checking the asyncdispatch code it seems safe to make concurrent send calls, and at least POSIX allows it at the OS level. It's 2-3x faster on h2load bench for every load I tried. This won't improve single stream data transfer, though.

A queue may block user code less as long as it's not full, but thinking about it, if the user code running between sends is really fast it'll eventually fill the queue and block, and if it's too slow then it does not matter much. The more streams the more likely the queue gets full. There may be a bench that shows code that runs exactly the right amount for a queue to help, but it seems artificial.

nitely commented 2 weeks ago

I added flow-control for streams back (#12), so data transfer is back to 10MB/s. Increasing the window size to 256KB seems to increase it to ~200MB/s. But cannot do that until flow-control for the connection is implemented.