Reimplemented LookupRateLimiter

Nucs commented 3 years ago

Hey @mathpaquette, I had few motivations to make this change, heres the changelog

I replaced Task.Delay with System.Threading.Timer which is exactly what Task.Delay does behind the scenes. Just no unnecessary Task and other AsyncStateMachine internal allocations.
Added LookupRateLimiter.Release method which triggers at the right point of time the StopWatch.
it is triggered when we received MessageContainer.IsEnd message.
Replaced the StopWatch with a similar mechanism struct FrequencyStopwatch which does not loose time across the delay/timer invocation.
LookupRateLimiter is constructed with 50 initial semaphore count.
Theres a higherPrecision flag in LookupRateLimiter constructor which in will produces calls more precision at the cost of 1 additional heap allocation and some cycles. It is false by default as for frequency 50 it seems to be not enough beneficial.
LookupRateLimiter.WaitAsync: removed unnecessary async keyword
I've added ConcurrencyBenchmarkHistoricalExample which does a stress test to the lookup calls.
Added inevitable lock, it only performs simple cpu instructions in the locking scope

All those combined reduces the heap allocation every call and timer invocation to minimum and the throttling is more identical to IQFeeds server-side. Also propably thanks to eliminating the two unnecessary AsyncStateMachine+Task, it is also likely to be faster than previous implementation regardless to the lock.

Please kindly review and test. Note that I've ran all examples and tests in production environment.

mathpaquette commented 3 years ago

@Nucs theoretically it sounds like a good idea but I'm trying with 14 concurrent clients and I'm only getting <20 requests per second from your ConcurrencyBenchmarkHistoricalExample when the limit is way beyond.

Also I noticed that your approach is using Windows internals and this is definitely not something we're aiming...

Also I noticed that for time to time, one unit test is unstable.

mathpaquette commented 3 years ago

@Nucs is there a way to highlight performance issues that you faced with the previous implementation. I mean, there's no CPU load at all.. How did you monitor all of this before raising this concern ? Just curious.

Nucs commented 3 years ago

@Nucs is there a way to highlight performance issues that you faced with the previous implementation. I mean, there's no CPU load at all.. How did you monitor all of this before raising this concern ? Just curious.

There weren't "performance issues" but rather 5+ unnecessary heap allocations occurring every loop.

In our private conversation I was referring to Task.Delay when using in high frequency (every Tick in my case). So obviously you will not feel ANY performance issues when the delay is 20ms between a call.

In the perspective of multithreaded model, there were multiple issues and in the new implementation I covered all of them. Previous implementation code was working relatively fine but in an optimistic way that timing the delay will somewhat match. Therefore that's why you could've received random exceptions. Worth saying that if you are convinced that this implementation is slower than previous one - you are welcome to set up a BenchmarkDotNet experiment .

I noticed that your approach is using Windows internals and this is definitely not something we're aiming...

This is .NET 5's implementation for Stopwatch with some of my changes. nothing there accesses undocumented code therefore nothing will suddenly break throughout newer versions if that is what you are implying. Merely a copy paste of Stopwatch.cs turning it into a struct. The parameters I'm getting are exactly what Stopwatch is getting. Let me know if that still serves as a problem.

I'm only getting <20 requests per second from your ConcurrencyBenchmarkHistoricalExample when the limit is way beyond.

That has to do with other factors. If you'll run the unit tests, they all report 20ms per request with an offset of 2% top.

Also I noticed that for time to time, one unit test is unstable.

Thats related to the benchmark. I am intentionally stressing the ThreadPool (which Tasks run on). It seems that the test has overloaded CPU because if you look how long it took to allocate the tasks - Therefore it cpu overload. Edit: the previous impl was accidently doubling number of requests based on the number of clients which causes non-objective benchmark.

Nucs commented 3 years ago

I applied some changes to ConcurrencyBenchmarkHistoricalExample, it was accidently doubling the number of requests when using more clients. Here is an objective benchmark with a reasonable burst of 100 tasks.

Running on I6-6700K, 8 threads 4 cores. 4 Clients:

8 Clients:

16 Clients:

As you can see, when the threadpool is not overloaded to a choke point - it seems to scale faster with higher number of clients.

Edit: "average call request time" is better the closer it is to 20ms.

mathpaquette commented 3 years ago

@Nucs did additional tests with current implementation and overall on 10 seconds, I have less than 1% in variation... Honestly I don't get the value for all this complexity. Just check the result by yourself.... And overall dont know what are we solving.

mathpaquette commented 3 years ago

your tests are unstable on CI and locally
you still have the issue with > 8 clients. definitely not something Ill spend more time on.

Nucs commented 3 years ago

Just to be clear.

Currents tests are stable because current tests target below 8 clients. (you dont fail what you dont test).
There is no proper stresstest test to show the library is stable which my PR concluded that before my changes - it isn't.
The solution in this PR overcomes current implementation's limits in all aspects, including number of clients it can run without failing.
Since you are not planning to spend more time on allowing more than 8 clients, i'll be forking away.

mathpaquette commented 3 years ago

Just to be clear.

Currents tests are stable because current tests target below 8 clients. (you dont fail what you dont test).

There is no proper stresstest test to show the library is stable which my PR concluded that before my changes - it isn't.

The solution in this PR overcomes current implementation's limits in all aspects, including number of clients it can run without failing.

Since you are not planning to spend more time on allowing more than 8 clients, i'll be forking away.

@Nucs

I dont know why this is not clear for you. Are you checking CI before saying your tests are stable? Unit tests are broken! Look if you want to merge something here, make sure your code is working properly and not in draft mode. I'm doing this on my free time and no user reported this issue.

mathpaquette commented 3 years ago

@Nucs your build doesnt pass and I cant even try out your version.

mathpaquette / IQFeed.CSharpApiClient

Reimplemented LookupRateLimiter #101