ntop / n2n

Peer-to-peer VPN
GNU General Public License v3.0
6.32k stars 947 forks source link

Is it necessary to implement multithreading support #323

Open fengdaolong opened 4 years ago

fengdaolong commented 4 years ago

Packet sending and receiving, encryption and decryption, compression and decompression, if enabling multi-thread support can improve efficiency, then we should implement it.

Logan007 commented 4 years ago

This is a wonderful idea!

Actually, as far as I can see, ZSTD should already support it (maybe it requires some parameter) and Pearson Hashing for check-summing could support it (though, it needs to be developed and evaluated). As I brought them in, I will definitely think about it.

As for the crypto, the situation gets somewhat scattered:

However, concerning packet sending and receiving, I fear that network operations and multi-threading might interfere too much. Those might be better off in a single threaded main loop. But I am not an expert on networking.

On smaller one-core non-SMT CPUs, multi-threading might not be beneficial at all, the contrary could happen! So it will either have to be part of n2n's init()-path (in some zstd_init() and the pearson_init() )to check for opportunities (number of available CPU cores, achievable speed-up) or needs to be designed compile-time dependant (#define MT, #define NO_CPU_CORES).

Uh, another point would be compatibility of Windows and pthreads

So, yeah, a lot of things to ponder and discuss. Thank you for starting this discussion, I am looking forward to a fruitful one!

Logan007 commented 4 years ago

Also, a pipelined approach comes into consideration: Threads with specific tasks (e.g. receive, decode, dispatch, handle or reassemble, …) send, i.e. pipeline, their output to the next thread which takes care of the next step.

But that would require a complete re-design of all internals. So, for now, I would refrain from this kind of multi-threading. Better concentrate on optional multi-threading of specific features as outlined one post earlier. Just to rectify the focus…

lucktu commented 4 years ago

Aye!

Logan007 commented 4 years ago

Yet another concept running through my head right now… n2n's main loop could be considered somewhat event triggered (new packet arrived either on internal or external socket). So, the main loop could fire up a thread handling this packet (still single threaded) while the main loop waits for the next packet to eventually fire up the next thread if another packet is in the queue.

This would still handle each packet itself in a single thread but in times of dense traffic (a lot of packets arriving around the same time), several threads will be run in parallel.

Care must be taken using mutexes for output resources like output sockets. Input resources for different packets (buffers) may not overlap as their handling cannot interfere. Also, this could change the sequence of packets as shorter packets might be handled faster than longer ones, they overtake. But who said that udp maintained sequence?

Is n2n's code thread-safe?

Logan007 commented 4 years ago

Okay, I took the challenge :muscle: and did some first trials trying to multi-thread Pearson Hashing. The basic idea is to have a thread for each 8-bit part and have them run in parallel for each byte of the output.

  1. Creating a thread each time the 8-bit routine is called dropped performance from ~ 356 MB/s down to ~ 22 MB/s using 1,024 byte sized packets to hash. I found that the overhead of thread creation obviously is not neglectable and discarded this alternative.

  2. Creating a thread while initializing and putting it on hold by locking a mutex. It gets started by unlocking the mutex as soon as parameters are supplied. Performance drop due to mutex handling is not as harsh as with first alternative: It gets down to ~ 80 MB/s on that same computer. Increasing packet size to 7,000 bytes made up for the overhead caused by mutex handling, just compensating and getting even to 16-bit performance, bigger packet sizes even more. But unfortunately, that is way beyond ethernet packet size.

So, two intermediate conclusions:

  1. Pearson Hashing is not worth to put efforts in multi-threading for the relatively small packet size occurring with n2n.

  2. Multi-threading overhead is significant. Thus, threads should be setup for long-working threads, such as the handling of a complete packet (better, i.e. lower overhead-to-benefit ratio) or parts of the ciphers only (expecting a higher and thus worse overhead-to-benefit ratio).

By the way, this was my first time multi-threaded programming in C ever! Pushing forward n2n makes me learn a lot – which is a lot of fun and a good thing! :+1:

fengdaolong commented 4 years ago

Thank you for your attempts and efforts

Logan007 commented 4 years ago

Again, it sometimes can be considered pure fun! :smile: Does this read nerdy? :frog:

Logan007 commented 4 years ago

ZSTD should already support it

Well, according to the manual, only if compiled with build macro ZSTD_MULTITHREAD. That is something out of n2n's reach, so I would not rely on that.

Is n2n's code thread-safe?

Probably not. Data structures like hash lists and other eee->...-fields get modified during packet handling. Every access (for which we would need to scour the source code for) needs to be protected by a respective mutex. Any volunteers?

For my part, I will think of multi-threading when reworking the ciphers for some future upcoming version. I think that I might aim at Twofish, AES or (not all of them) mabye SPECK to make one of them benefit from multi-threading.

Logan007 commented 4 years ago

For my part, I will think of multi-threading when reworking the ciphers for some future upcoming version. I think that I might aim at Twofish, AES or (not all of them) mabye SPECK to make one of them benefit from multi-threading.

I gave it a shot. I tried to multi-thread the already blazingly fast SPECK… and gloriously failed. I used the same scheme from above (number [2]: initialize everything – keep worker thread ready – controlled by mutexes – have it run as soon as parameters are supplied). I did it for one thread to get an estimate on the setup-to-benefit ratio.

i7 2860QM Performance dropped from 966 MB/s down to 80 MB/s on 512-byte sized packets as in tools/n2n-benchmark. So, it would need more than 12 cores to make up for the multi-threading setup. Given that the optimal chunk size for SSE4.2 accelerated SPECK is 128 bytes, it would all even out at 1,536 byte packet size – at 12 cores! This CPU only has 4 (plus 4 SMT). By the way, increasing packet size in tools/n2n-benchmark to 1,500 bytes handled by a single thread show a speed of 1,073 MB/s that wants to be beaten (more efficent on bigger sized packets).

i7 7500U Even worse, because the AVX2 enabled single threaded version is so fast that all the multi-threading setup completely drags it down: 1,907 MB/s go down to 122 MB/s which would require 16 cores to break even. As preferred chunk size for AVX2 implementation is 256 bytes, we would need a packet size of 4,096 (and 16 cores) to start beat the single threaded version which, by the way again, on 1,500-byte packets even sees the 2 GB/s frontier – from above: 2,127 MB/s.

SPECK already seems so lightweight that it does not make sense to burden it with multi-threading setup. At least, not to handle our relatively small packet sizes.

So, I will probably keep an eye on the other ciphers and keep you posted, for sure… :wink:

Logan007 commented 4 years ago

OFF TOPIC Given SPECKs high speeds as seen again here, I really am getting in fond of the this cipher. I know, it is debatable, but I for myself decided to switch my networks to -A5. It really benefits from SSE4.2 and AVX2.

With the upcoming Tiger Lake, it seems that AVX512 will come to desktop computers, too. AVX512 has been more of a server CPU domain before. I would gladly offer to port the original template to AVX512 but unfortunately, I do not have access to such hardware.

So, if anyone is able and willing to offer access to such a computer (probably only severs for now, have not heard of plenty Athena laptop computers being around yet) by ssh, I will be able to do it in a few days, maybe a week. It would just require cli ssh bash access to some linux computer with working git, gcc (and basically the tools required to build n2n), nano (including the colored syntax highlighting), openSSL (for performance comparison to the other ciphers) and working sudo (touchy, I know). I do not know if containers would maintain full access to the CPU features, but if so, maybe that would be a feasible way to do it.

Let me know, we could arrange details via Telegram @LoganOosEven .

P. S. I will try to setup some QEMU, not sure how this will work out. Access to real hardware would still be appreciated.

UPDATE This has been solved in #630 using Intel's SDE.

Oliver0624 commented 1 year ago

I have a question (that may be stupid). Will it improve efficiency if I use a packet-level multithread? I mean that we can use several worker threads and a lock-free queue to deal with packet processing.

Logan007 commented 1 year ago

That's actually one idea being discussed among the maintainers. It is expected to increase throughput on the one hand but on the other hand, we still would need some locks on data structures that get updated at certain points so increased delay. So, expecting higher ping (if measurable) along with higher bandwidth and also some out-of-order packets if threads un-synced (with additional cost of either syncing threads or have network stack or app level figure it out).

Not implemented yet, not even for testing, this is an item on our list.

Oliver0624 commented 1 year ago

we still would need some locks on data structures that get updated at certain points so increased delay

Oh....Do you mean some statistics data structures?

So, expecting higher ping (if measurable) along with higher bandwidth and also some out-of-order packets if threads un-synced (with additional cost of either syncing threads or have network stack or app level figure it out).

I indeed forgot thread sync. I think we can use another lock-free result queue and a counter to mark every result of packet processing in the queue. so a single sending thread can process sequentially. As far as I know, socket is not thread-safe.

Logan007 commented 1 year ago

Oh....Do you mean some statistics data structures?

The most important would be the list of peers and corresponding basic data which gets updated during packet processing... like their MAC, IP, last_seen (which does not only serve statistical purpose)... and yes, some stats like packet count...

I indeed forgot thread sync. I think we can use another lock-free result queue and a counter to mark every result of packet processing in the queue. so a single sending thread can process sequentially. As far as I know, socket is not thread-safe.

Yes, a queue would definitely help here. As far as I can see it, you however won't get around some kind of thread-related communication mechanism like waiting for some thread (still processing the next packet) to finish or, depending on how you want to implement it, mutex on queue (in case of different threads filling the queue and reading from it).