tomp2p / TomP2P

A P2P-based high performance key-value pair storage library
http://tomp2p.net
Apache License 2.0
438 stars 122 forks source link

DHT performance degrades with more values in the ring #128

Open sgoendoer opened 8 years ago

sgoendoer commented 8 years ago

Hi,

I stumbled upon an issue with the ring when it's being flooded with values. I noticed, when there are more values in the ring, the reaction time increases dramatically.

Setup: I have 3 virtual nodes running. Not very performant ones, but oh well... They are running Debian Wheezy. TomP2P runs in version 4.4 (via Maven), a Jetty receives REST requests with to be written to the ring.

Now I wrote a small shell script that pushes a key-value pair to one of the nodes in a loop, i.e. I am writing as much and as fast as I can to the ring. The script then measures and logs the time needed for such a request to complete.

Results: At first, the requests times are ok. Something like an average of 1.0 seconds, a median of 0.99, and a maximum of 1.3. Interestingly, there is an "recurring outlier": Approx. every 50th request takes significantly more time to complete (like 1.5 seconds in the beginning).

Observing this for a few thousand requests, the average and median request times remain close to 1.0 to 1.2 seconds, while the request duration of this "recurrent outlier" increases linear! After as little as 5k requests we are talking about a duration of 3.8 seconds already!

Apparently, with an increasing number of values being written to the ring, the performance changes for the worse. Big time! After approx 33k requests, the outliers take as much as (up to) 75 seconds (!!!!!) to complete, while the median duration remains close to what it was in the beginning: The median is still at 1.01 seconds (!!!!!), while the average increased to 1.8 seconds (mainly due to the outlier I guess).

Is this a known issue?

Graph: graph

raw data: requesttimes.txt

sgoendoer commented 8 years ago

Running the same setup now for 2 days straight. Max values go up as far as 190 seconds!

tbocek commented 8 years ago

Thanks for the report. Can you try the latest 5.0 release? Its still beta, but more stable than 4.4. Thanks.

sgoendoer commented 8 years ago

You mean beta8? We are currently working on including it. I will post results as soon as we have some.

An update on the issue: After approx. 50k datasets, the DHT was "full", so we stopped the test. "Full", meaning requests took like 20 minutes (!!!) regardless of whether we tried to read or write. I figure that was mainly due to RAM limitations, as our nodes all feature just 1GB of memory. We logged all data we pushed to the ring and reached approx. 1 GB of data being logged around 50k datasets. So this might be the explanation for this. Anyhow, looking into the log files, the following line showed up a lot:

2016-02-08 12:09:53 INFO  Scheduler:99 - slow down, we have a huge backlog!

In the meantime: Here is some data from the test i ran. I calculated average, minimum, maximum, and median request times for each 1000 requests:

results

The missing max-value was 227.5485981 and I deleted it to make the chart readable. After 50k requests, data got MUCH worse...

ChronosXYZ commented 5 years ago

Have you used disk-based storage?