tumblr / colossus

I/O and Microservice library for Scala
Apache License 2.0
1.14k stars 96 forks source link

Remove connection limiter and slow start #621

Closed dxuhuang closed 6 years ago

dxuhuang commented 6 years ago

@benblack86 @amotamed @aliyakamercan @jlbelmonte @DanSimon

https://github.com/tumblr/colossus/issues/617 I DO see this used in messenger and LS though.

codecov-io commented 6 years ago

Codecov Report

Merging #621 into develop will decrease coverage by 0.14%. The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #621      +/-   ##
===========================================
- Coverage    84.91%   84.76%   -0.15%     
===========================================
  Files           99       98       -1     
  Lines         4460     4432      -28     
  Branches       370      360      -10     
===========================================
- Hits          3787     3757      -30     
- Misses         673      675       +2
Impacted Files Coverage Δ
.../src/main/scala/colossus/core/ServerSettings.scala 100% <ø> (ø) :arrow_up:
...s/src/main/scala/colossus/core/server/Server.scala 94.36% <100%> (-0.08%) :arrow_down:
colossus/src/main/scala/colossus/core/Worker.scala 76.99% <0%> (-1.77%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 41c6d99...485e000. Read the comment docs.

DanSimon commented 6 years ago

Why is this being removed?

benblack86 commented 6 years ago

Slow startup concept is not useful since the service takes the same amount of traffic, just with less threads. The solution is to actually send less traffic to the service, i.e. configure proxy to ramp traffic over time.

DanSimon commented 6 years ago

So a few things:

  1. The amount of traffic handled during startup would depend on how the clients behave, but I agree overall it's not an effective solution. All threads are always used, but the number of open connections per worker is restricted, but if the service is behind a proxy which is pooling connections it probably won't help much. So overall I'm 👍 with removing this.

  2. I don't agree though that the solution is to just require the clients to send less traffic. We should be thinking defensively and assume that clients are unaware of each other and/or assume the service can regulate itself. Personally I think the proper solution is to have true rate-limiting where we can configure a global concurrency or request-per-second limit across a service. Right now the only kind of request limiting is with limiting the size of the request buffer, and that's a per-connection limit. Definitely should have a larger discussion about this.

  3. I think instead of just removing this abruptly, we should get in the habit of first marking it as deprecated for a release and then removing on the following release.

benblack86 commented 6 years ago

2 - Do you know how other frameworks handle this? At least in the world of containers/scheduling this stuff is part of the rollout strategy, i.e. slowly move traffic from one version to the next. Therefore, the service doesn't need to worry about it.

3 - I agree, although since this was added recently and (I think?) there are no docs, I'd be surprised if anyone apart from Tumblr is using it.

DanSimon commented 6 years ago

Re: point 2 - other frameworks handle it with rate-limiting. This can work either by returning something like a HTTP 429 or by simply queueing up requests until the TCP window is full. We already do this in Colossus except it's on a per-connection basis. We would need some kind of resource that all connections across a service could access to do global rate limiting. Then we could do a slow-start like we do here, but since it would be per-service and based on request concurrency, it wouldn't matter how many connections were open.

Re: point 3 - I agree, so I'd say in this case I'm 👍 with merging, though I hope going forward we can be a bit more careful with how we handle stuff like this.