[SIP] Authentication based multi-user-single-port

madeye commented 6 years ago

Background

Previous discussions suggest that we do authentication (SIP004) on the first chunk with different keys, to identify the user based on the success key.

Implementation Consideration

Performing GCM/Poly1305 on the first chunk should be very fast. It's expected that even a naive implementation would support thousands of users without any notable overhead.

Still, we can cache the success keys for its source IP, which would save most of computation. To prevent potential DDOS attack, the IP that tries too many times with authentication failure should be blocked.

Given this SIP doesn't involve any protocol change, only server code needs to be modified. The only limitation here is that AEAD ciphers are required.

Example

Jigsaw implemented a go-ss2 based server here: https://github.com/Jigsaw-Code/outline-ss-server. Early report shows that it works quite well with 100 users: https://github.com/shadowsocks/shadowsocks-org/issues/128#issuecomment-415810597

Mygod commented 6 years ago

Have you considered the possibility that NAT might mess with your cache? Namely, if two clients behind the same NAT router try to connect to the same server with different credentials, god bless you because they have the same source IP address to the server.

kimw commented 6 years ago

Have you considered the possibility that NAT might mess with your cache? Namely, if two clients behind the same NAT router try to connect to the same server with different credentials, god bless you because they have the same source IP address to the server.

May be that's what we called, THE COST :)

Things cannot be perfect. It's depend on a BALANCE.

Do not support multi users run on a single port (I mean really multi, e.g. 100 users) => multi ports should be opened <= It's a abnormal server side behavior.
Multiple users mass in a single port, oh yes, it's cool!

And, it looks some kind "clean" from server side. The operator of SSP, shadowsocks service provider, must buy your beer.

And * 2, ss-manager should, maybe, retired.

That's just some personal comment. This SIP need more balance in any case.

kimw commented 6 years ago

More words:

If a shadowsocks server supports only countable user/users, it's abnormal behavior too.

--

Following on this idea, maybe a later SIP should about exchange shadowsocks in a kind of circle (known friends? trusted servers?)

Mygod commented 6 years ago

Hmm if you're okay with the COST that users one day complaining to you it's not working because of NAT v.s. cache issue.

I suggest either not taking the cache approach, or use other protocols that already supports multi-user like v2mess (I haven't looked at the protocol yet but it seems that that protocol supports this use-case).

Different people prefer different balance between things. I don't think Shadowsocks is intended to cover all kind of balances you wish.

riobard commented 6 years ago

Hmmm… I think if we're gonna officially support multiuser per port, we might as well address the problem cleanly? #54 is still open ^_^

riobard commented 6 years ago

But I agree this hack is neat in that it does not require any changes in the clients. 👍

Mygod commented 6 years ago

Also I should point out that the problem I pointed out might occur more frequently than you imagine thanks to exhausted IPv4 pool and widely-deployed CGN. It's likely that one will run into such frustration despite having taken precautions.

riobard commented 6 years ago

CGN is a major concern. We might need to run some tests to determine the rough size of NAT pools used by ISPs doing massive CGN.

madeye commented 6 years ago

NAT should not be a problem, as long as not all of the users are behind the same NAT address.

Say five users behind a same NAT ip address, at most five keys cached for that IP.

madeye commented 6 years ago

This SIP just suggests a kind of multi-user-single-port solution for shadowsocks without modifying the protocol.

But as mentioned by @Mygod, shadowsocks is not designed for this purpose.

I listed this SIP here since it's already implemented in a third-party software. If anyone is interested in it as well, please follow this SIP and apply the suggested optimizations.

riobard commented 6 years ago

My worry is that people will eventually abuse this hack to run commercial services. It's not gonna scale well when users are mostly behind CGN with small pool of public IPs, e.g. mobile networks in China.

Mygod commented 6 years ago

CGN also applies to ADSLs. Also one shouldn't forget NAT routers in enterprises, schools, etc. A good way to combat this is to enlarge cache size and always do a fallback lookup.

madeye commented 6 years ago

Fallback lookup is always needed. Even a key is cached, the authentication is still required. If failed for authentication, a fallback lookup is performed.

I don't expect millions of users on one single port. A reasonable assumption is thousands of users per server, hundreds per port.

And of course, it cannot scale for commercial usage.

celeron533 commented 6 years ago

In some places, the ISP may do the NAT for entire neighborhood which may include 10,000 end users by assigning the ip address with prefix 100.64. It is also a kind of NAT.

https://tools.ietf.org/html/rfc6598

IANA Considerations

IANA has recorded the allocation of an IPv4 /10 for use as Shared Address Space.

The Shared Address Space address range is 100.64.0.0/10.

riobard commented 6 years ago

@celeron533 This is CGN mentioned above.

shinku721 commented 5 years ago

Hmm, why not use a ElGamal-like method to identify users?

Mygod commented 5 years ago

Compatibility.

fortuna commented 5 years ago

FYI, Outline Servers have all been migrated to outline-ss-server this week. They don't yet use the single port feature, but we intend to enable it in a few weeks, after I implement the IP->cipher cache.

We can roll that out gradually and see how it performs in the wild. In my own tests, the added latency for 100 users without any optimization in a crappy $5 VPS can be significant, 10s of milliseconds, but it can vary wildly, and I believe the optimizations will help significantly. Also, outline-ss-server has Prometheus metrics, so we will be able to expose latency metrics and admins will be able to monitor that.

BTW, outline-ss-server still allows for multiple ports, and you can have multiple keys per port, and multiple ports per key. You can always start a new port if one becomes overloaded. One nice feature is that you can do that without creating a new process for each port, or stop the running one.

fortuna commented 5 years ago

It's worth mentioning that the single-port feature has some very good motivation:

It makes it a lot easier and safer to configure your server firewall. No need to open all the ports.
It allows all servers to run on ports 443, 80 or any other usually unblocked port. We found multiple cases of users not being able to use Outline in strict networks that doesn't allow traffic to high port numbers, or outside a small subset of ports.
It allows Outline Servers to run on a Docker container without needing --net=host (you can expose the single port instead)
In the future, we'll be able to run the Outline Server management API and the Shadowsocks service on the same port, by making it fallback to HTTPS to the management API if all keys fail. This will make the servers even harder to detect (you'll get a standard 404).

fortuna commented 5 years ago

I now have a benchmark for my single-port implementation: https://github.com/Jigsaw-Code/outline-ss-server/pull/7

These are the results on a $5 Frankfurt DigitalOcean machine that is idle:

BenchmarkTCPFindCipher      1000       1304879 ns/op     2015027 B/op       3107 allocs/op
BenchmarkUDPUnpack          3000        615077 ns/op      115427 B/op       1801 allocs/op

That's 1.3ms to go over 100 ciphers for a TCP connection. 0.6 ms for a UDP datagram. That will probably be worse under load, but it gives an idea of the kind of added latency we'd be adding.

There's 2MB of allocations for one TCP connection. I believe that can be significantly reduced by sharing buffers, but it gets a little tricky with the code structure and different ciphers needing different sizes of buffers (I guess I need to find the max buffer size).

riobard commented 5 years ago

@fortuna That's a lot of allocs/op. Is that normal?

fortuna commented 5 years ago

PR https://github.com/Jigsaw-Code/outline-ss-server/pull/8 makes the TCP performance on par with UDP. We no longer allocate so much memory:

BenchmarkTCPFindCipher-12           1000       1349922 ns/op      125278 B/op       1705 allocs/op
BenchmarkUDPUnpack-12               2000        881121 ns/op      125030 B/op       1701 allocs/op

The ~2MB allocations were because I was allocating a buffer for an entire encrypted chunk (~16KB) for each of the 100 ciphers I tried. Now I allocate only one buffer for all ciphers

As for the number of allocations, it's just that' I'm doing the operation 100 times. For 1 cipher only I get these numbers:

BenchmarkTCPFindCipher-12          30000         52329 ns/op        1408 B/op         22 allocs/op
BenchmarkUDPUnpack-12             200000          8989 ns/op        1266 B/op         18 allocs/op

fortuna commented 5 years ago

With the new findAccessKey optimization, the allocations and CPU are dominated by the low level crypto, so I'm not sure there's much room to improve there:

This is without the IP -> cipher cache. I'm trying to make the cipher finding as efficient as possible, to reduce the need for the cache.

fortuna commented 5 years ago

FYI, I've added an optimization to outline-ss-server that will keep the latest used cipher in the front of the list. This way the time to find the cipher is proportional to the number of ciphers being used, rather than the total ciphers.

Furthermore, I've added the shadowsocks_time_to_cipher_ms metric that will tell you the 50th, 90th and 99th percentile times to find the cipher for each access key.

This should be enough to inform us whether the performance is good enough. It would be great if people here gave it a try and reported back. The lastest binary with the changes is v1.0.3 and can be found in the releases: https://github.com/Jigsaw-Code/outline-ss-server/releases

fortuna commented 5 years ago

Update: Outline has been running servers with multi-user support on a single port for a few months now. Some organizations have 300 keys on a server, with over 100 active on any given day. Median latency due to cipher finding is around 10ms and CPU usage is minimal (bandwidth is the bottleneck).

At 90th percentile you can see cases here and there close to 1 second, but that's not common and may be due to other factors such as a burst in CPU usage (maybe expensive prometheus queries).

Has anyone here tried the single port feature? How was your experience?

madeye commented 5 years ago

Average 10ms latency looks too slow to me.

Assuming 300 users and the worst case that 300 authentications performed for each connection, one single authentication takes 33us. It means more than 33k cycles on a 1 GHz CPU, which is too long for a small packet authentication.

Can you elaborate more about the measurement of latency?

Mygod commented 5 years ago

2998 light-kilometer might or might not be acceptable depending on use case, e.g. it's probably not acceptable for game streaming but probably ok for downloading/video streaming. :smile:

fortuna commented 5 years ago

This site says that 20ms is excellent RTT. So 10ms shouldn't be perceptible.

Also, this is latency added per connection, not per packet.

Mygod commented 5 years ago

How about UDP connections/packets (which are mostly used in latency-sensitive applications)?

fortuna commented 5 years ago

I have a benchmark above: https://github.com/shadowsocks/shadowsocks-org/issues/130#issuecomment-447063760

UDP takes about 9 microseconds per cipher.

Mygod commented 5 years ago

@fortuna Sorry, I mean to ask whether the added latency for UDP connections is per-connection or per-packet.

fortuna commented 5 years ago

Yes, the added latency is per packet.

On Fri, Apr 12, 2019 at 11:51 PM Mygod notifications@github.com wrote:

@fortuna https://github.com/fortuna Sorry, I mean to ask whether the added latency for UDP connections is per-connection or per-packet.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/shadowsocks/shadowsocks-org/issues/130#issuecomment-482773393, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG7nQCeTU6QZdIMnf9u1Sf527B4pe8xks5vgVQsgaJpZM4W6UI5 .

Mygod commented 5 years ago

I think it would be more appropriate to optimize for UDP connections (I think there are UDP lookup caches in libev implementation)

fortuna commented 5 years ago

Oh, the cipher finding overhead is per UDP packet from the client. We don't need to find the cipher for the UDP packets from the remote target, because the chosen cipher is saved in the UDP association.

That means the overhead will be minimal if you're watching a video.

I guess it could be a concern if you're live streaming, but then your cipher will be kept near the front of the cipher list, which minimizes the overhead.

Mygod commented 5 years ago

@fortuna Is it technically possible to do a cache for UDP packets as well?

fortuna commented 5 years ago

Update: @bemasc has merged https://github.com/Jigsaw-Code/outline-ss-server/pull/25 that adds a new optimization to the cipher finding. We now associate a "last client ip" to each cipher. When a new request arrives, we lookup the ciphers that had the client ip as the last ip, and try them first, before trying the the prioritized list.

If a cipher is accessed by a single IP, it will always be tried first. If a cipher is accessed by multiple IPs simultaneously, it's likely to stay in the front of the priority list.

With the optimization, any extra latency will be almost gone for almost everyone, even if there are hundreds of active access keys.

@Mygod, the heuristic of pushing used ciphers to the front of the list, as well as the new one, are applied to both TCP and UDP.

riobard commented 5 years ago

@fortuna Neat! Almost two orders of magnitude latency reduction in the common case! I'm really surprised by how far you guys have pushed forward without changing the protocol 👍

Ehco1996 commented 3 years ago

i also impl the many user in one port use python asyncio

core idea is to use db order field to find the right user

code is here

https://github.com/Ehco1996/aioshadowsocks/blob/052c472422955c4ade7d0e375c8d093231aff1a9/shadowsocks/mdb/models.py#L157

ghost commented 3 years ago

We can use same technology to eliminate the need of the encryption method selection. Server try both AES-256-GCM and Chacha20-Poly1305 with same password (they have same tag size and salt size, thus have exact same packet layout). Client choose the fastest one depends on it's platform.

Remove encryption selection might be too radical (and lack of foresight. With this selector, we've introduced new protocol) for us, but still an option for other shadowsocks-like protocol.

lzm0 commented 2 years ago

This may be a stupid question, but what prevents us from using a HashSet for cipher lookup?

fortuna commented 2 years ago

@lzm0 there's no id in the Shadowsocks protocol that can be mapped to the credentials to use, so there's no key to lookup. That's why we need to use trial decryption.

database64128 commented 2 years ago

Shadowsocks 2022 (#196) has a protocol extension that brings native multi-user-single-port support without trial decryption: https://github.com/Shadowsocks-NET/shadowsocks-specs/blob/main/2022-2-shadowsocks-2022-extensible-identity-headers.md

shadowsocks / shadowsocks-org