Stress testing - Githubissues

kabirkbr commented 6 years ago

Stress testing of the current framework would involve persistently running new cycle generation and search processes on a moderately large network (1-5 million agents), based on the code for milestone 0.0.1. All processes will be decentralized. The goal is to push the scale of simulations (in terms of number of actors, number of links between them, number events per second etc) to somewhere that may resemble real setting.

The minimum scale of stress test simulation would be: number of agents: 1M; number of active daily agents: 100K; 'active daily agents' are those which put new offer-demand pairs or those who initiate search processes.

The program maximum would be reach the scale of the network when the framework cannot cope with the size and load anymore...

kabirkbr commented 6 years ago

Running stressTesting code on the basis of PR #67 seem to indicate memory leaks either in Agent.class or Simulation.class. Three indicative runs with the following parameters:

018-11-21 13:24:14,116 [Test worker] INFO  Experiments.class  : 
method=stressTesting : 
message=after storing agentId and vertexId in Agent.class as parameters; may solve the timeout of getVertexId(); : 
experimentId=EXP11-21-01-24-VBMaha : 
simulationId=SIM11-21-01-24-HcHRNr--ST : 
parameters={
similarityConnectThreshold=0.99, 
initialAgentNumber=50, 
newAgentsPerSecond=1, 
chainLenghts=[5, 10], 
maxSimilaritySearchDistance=20, 
maxCycleSearchDistance=15, 
similaritySearchThreshold=0.99, 
intialGraphType=smallWorld, 
timeoutStressTesting=7200000, 
randomPairsRatePerAgentPerDay=1.50, 
chainCreationRatePerAgentPerDay=0.140, 
randomCycleSearchRatePerAgentPerDay=3.0, 
targetedCycleSearchRatePerAgentPerDay=0.120,
searchAndConnectRatePerAgentPerDay=1.0, 
millisBetweenRandomPairsPerAgent=57600036.86402359, 
millisBetweenChainCreationPerAgent=617142998.2041139, 
millisBetweenRandomCycleSearchPerAgent=28800018.43201180, 
millisBetweenTargetedCycleSearchPerAgent=719994240.046080, 
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}

2018-11-21 15:03:38,813 [Test worker] INFO  Experiments.class  : 
method=stressTesting : 
message=after storing agentId and vertexId in Agent.class as parameters; may solve the timeout of getVertexId(); : 
experimentId=EXP11-21-03-03-NWmsEL : 
simulationId=SIM11-21-03-03-udbAEI--ST : 
parameters={
similarityConnectThreshold=0.99, 
initialAgentNumber=50, 
newAgentsPerSecond=1, 
chainLenghts=[5, 10], 
maxSimilaritySearchDistance=20, 
maxCycleSearchDistance=15, 
similaritySearchThreshold=0.99, 
intialGraphType=smallWorld, 
timeoutStressTesting=7200000, 
randomPairsRatePerAgentPerDay=1.50, 
chainCreationRatePerAgentPerDay=0.140, 
randomCycleSearchRatePerAgentPerDay=3.0, 
targetedCycleSearchRatePerAgentPerDay=0.120, 
searchAndConnectRatePerAgentPerDay=1.0, 
millisBetweenRandomPairsPerAgent=57600036.86402359, 
millisBetweenChainCreationPerAgent=617142998.2041139, 
millisBetweenRandomCycleSearchPerAgent=28800018.43201180, 
millisBetweenTargetedCycleSearchPerAgent=719994240.046080, 
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}

2018-11-21 19:45:27,940 [Test worker] INFO  Experiments.class  : 
method=stressTesting : 
message=after storing agentId and vertexId in Agent.class as parameters; may solve the timeout of getVertexId(); : 
experimentId=EXP11-21-07-45-sIfu52 : 
simulationId=SIM11-21-07-45-6DijXy--ST : 
parameters={
similarityConnectThreshold=0.99, 
initialAgentNumber=50, 
newAgentsPerSecond=1, 
chainLenghts=[5, 10], 
maxSimilaritySearchDistance=20, 
maxCycleSearchDistance=15, 
similaritySearchThreshold=0.99, 
intialGraphType=smallWorld, 
timeoutStressTesting=7200000, 
randomPairsRatePerAgentPerDay=1.50, 
chainCreationRatePerAgentPerDay=0.140, 
randomCycleSearchRatePerAgentPerDay=3.0, 
targetedCycleSearchRatePerAgentPerDay=0.120, 
searchAndConnectRatePerAgentPerDay=1.0,
millisBetweenRandomPairsPerAgent=57600036.86402359, 
millisBetweenChainCreationPerAgent=617142998.2041139, 
millisBetweenRandomCycleSearchPerAgent=28800018.43201180, 
millisBetweenTargetedCycleSearchPerAgent=719994240.046080, 
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}

crash after reaching 1500 - 2000 agents:

cat OfferNet.log | grep '<init>' | grep SIM11-21-01-24-HcHRNr--ST | wc -l
2005 

cat OfferNet.log | grep '<init>' | grep SIM11-21-03-03-udbAEI--ST | wc -l
1655

cat OfferNet.log | grep '<init>' | grep SIM11-21-07-45-6DijXy--ST | wc -l
1324

all system resources reach the maximum (all three simulations are shown in the same graph, peak indicate times of crash):

First simulation crashed because of the timeout in actor system; the two last simulations crashed when java heap filled all availabe RAM. First thing to check was how many RAM is allocated to which processes; DSE uses half of system's memory (32GB); Elastic Stack infrastructure uses another 10GB, and offernet code fills all the rest when running. But this does not seem correct -- Agents should not take this amount of memory.

This is what JVisualVM says so far (for the third simulation):

It seams from the heap histogram that most of the memory is taken by LinkedHashMaps and all kinds of arrays, which are probably global structures. Will need to track them down and optimize...

To do next:

Reduce DSE heap from 32 to 16 and see if it still handles the load;
Profile offernet code and see what eats resources (starting from above debugging info);

Wish me luck.

kabirkbr commented 6 years ago

By the way, 'crash' means that DSE crashes, usually performing a Garbage Collect. So it could be that the problem is in DSE rather than in ActorSystem, but I would opt for the latter first, and then will check DSE if other ideas lead nowhere...

kabirkbr commented 6 years ago

It may be that after all the bottleneck is DSE CPU and memory consumption. After putting changing DSE JVM requirements to min 16GB and max 32GB:

2018-11-22 12:05:02,489 [Test worker] INFO  Experiments.class  : m
ethod=stressTesting : 
experimentId=EXP11-22-12-05-YKEqRd : 
simulationId=SIM11-22-12-05-26YYjN--ST : 
parameters={
similarityConnectThreshold=0.99, 
initialAgentNumber=50, 
newAgentsPerSecond=1, 
chainLenghts=[5, 10], 
maxSimilaritySearchDistance=20, 
maxCycleSearchDistance=15, 
similaritySearchThreshold=0.99, 
intialGraphType=smallWorld, 
timeoutStressTesting=7200000, 
randomPairsRatePerAgentPerDay=1.50, 
chainCreationRatePerAgentPerDay=0.140, 
randomCycleSearchRatePerAgentPerDay=3.0, 
targetedCycleSearchRatePerAgentPerDay=0.120, 
searchAndConnectRatePerAgentPerDay=1.0,
millisBetweenRandomPairsPerAgent=57600036.86402359, 
millisBetweenChainCreationPerAgent=617142998.2041139, 
millisBetweenRandomCycleSearchPerAgent=28800018.43201180, 
millisBetweenTargetedCycleSearchPerAgent=719994240.046080, 
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}

the situation is the same and simulation crashes after creating 2000 agents:

cat ~/offernet/logs/OfferNet.log | grep '<init>' | grep SIM11-22-12-05-26YYjN--ST | wc -l
1990

Here is how the full system load looks like:

And here is how OfferNet ActorSystem load looks like:

Further, cassandra log file shows that the crash happened when performing garbage collect:

cat /var/log/cassandra/debug.log | tail -n 50
[...]
INFO  [Service Thread] 2018-11-22 12:50:31,134  GCInspector.java:290 - G1 Young Generation GC in 382ms.  G1 Eden Space: 15250489344 -> 0; G1 Old Gen: 10234102064 -> 10791944504; G1 Survivor Space: 1308622848 -> 1207959552; 
DEBUG [ScheduledTasks:1] 2018-11-22 12:50:31,441  MonitoringTask.java:173 - 3 operations were slow in the last 5000 msecs:
<SELECT * FROM offernet.work_e WHERE community_id = 1496222336 AND member_id = 49 AND ~~edge_label_id = 65550 LIMIT 50000>, time 505 msec - slow timeout 500 msec
<SELECT * FROM offernet.item_e WHERE community_id = 2052914560 AND member_id = 189 AND ~~edge_label_id = 65553 LIMIT 50000>, time 505 msec - slow timeout 500 msec
<SELECT * FROM offernet.item_e WHERE community_id = 2140990336 AND member_id = 212 AND ~~edge_label_id = 65553 LIMIT 50000>, time 504 msec - slow timeout 500 msec
INFO  [Service Thread] 2018-11-22 12:50:33,567  GCInspector.java:290 - G1 Young Generation GC in 439ms.  G1 Eden Space: 13103005696 -> 0; G1 Old Gen: 10787750224 -> 11412701512; G1 Survivor Space: 1207959552 -> 1107296256; 
INFO  [Service Thread] 2018-11-22 12:50:36,812  GCInspector.java:290 - G1 Young Generation GC in 352ms.  G1 Eden Space: 17230200832 -> 0; G1 Old Gen: 11408507232 -> 11416896344; G1 Survivor Space: 1107296256 -> 1627389952; 
INFO  [Service Thread] 2018-11-22 12:50:39,112  GCInspector.java:290 - G1 Young Generation GC in 477ms.  G1 Eden Space: 16181624832 -> 0; G1 Old Gen: 11412702064 -> 12020875632; G1 Survivor Space: 1627389952 -> 1660944384; 
INFO  [Service Thread] 2018-11-22 12:50:41,169  GCInspector.java:290 - G1 Young Generation GC in 543ms.  G1 Eden Space: 10301210624 -> 0; G1 Old Gen: 12016681352 -> 13040091528; G1 Survivor Space: 1660944384 -> 1065353216; 
INFO  [Service Thread] 2018-11-22 12:50:44,044  GCInspector.java:290 - G1 Young Generation GC in 371ms.  G1 Eden Space: 14319353856 -> 0; G1 Old Gen: 13035897248 -> 13677625768; G1 Survivor Space: 1065353216 -> 964689920; 
INFO  [Service Thread] 2018-11-22 12:50:47,024  GCInspector.java:290 - G1 Young Generation GC in 532ms.  G1 Eden Space: 15233712128 -> 0; G1 Old Gen: 13673431488 -> 13681820600; G1 Survivor Space: 964689920 -> 1610612736; 
INFO  [Service Thread] 2018-11-22 12:50:48,698  GCInspector.java:290 - G1 Young Generation GC in 447ms.  G1 Eden Space: 5242880000 -> 0; G1 Old Gen: 13681820600 -> 15279849928; G1 Survivor Space: 1610612736 -> 285212672; 
INFO  [Service Thread] 2018-11-22 12:50:51,492  GCInspector.java:290 - G1 Young Generation GC in 286ms.  G1 Eden Space: 14562623488 -> 0; G1 Old Gen: 15275655648 -> 15284044760; G1 Survivor Space: 285212672 -> 830472192; 
DEBUG [ScheduledTasks:1] 2018-11-22 12:50:51,499  MonitoringTask.java:173 - 4 operations were slow in the last 4763 msecs:
<SELECT * FROM offernet.agent_e WHERE community_id = 553039488 AND member_id = 17 AND ~~edge_label_id IN (65556, 65557) LIMIT 50000>, time 561 msec - slow timeout 500 msec
<SELECT * FROM offernet.agent_e WHERE community_id = 1879330176 AND member_id = 13 AND ~~edge_label_id IN (65556, 65557) LIMIT 50000>, time 567 msec - slow timeout 500 msec
<SELECT * FROM offernet.item_p WHERE community_id = 328333824 AND member_id = 142 LIMIT 50000>, time 561 msec - slow timeout 500 msec
<SELECT * FROM offernet.item_e WHERE community_id = 1058980096 AND member_id = 282 AND ~~edge_label_id = 65553 LIMIT 50000>, time 578 msec - slow timeout 500 msec
INFO  [Service Thread] 2018-11-22 12:50:53,990  GCInspector.java:290 - G1 Young Generation GC in 336ms.  G1 Eden Space: 13883146240 -> 0; G1 Old Gen: 15279850480 -> 15288239088; G1 Survivor Space: 830472192 -> 1400897536; 
INFO  [Service Thread] 2018-11-22 12:50:55,826  GCInspector.java:290 - G1 Young Generation GC in 447ms.  G1 Eden Space: 8631877632 -> 0; G1 Old Gen: 15288239088 -> 16097739248; G1 Survivor Space: 1400897536 -> 914358272; 
[...]

To do: -- figure out if there could be DSE performance issues; -- profile gremlin queries in order to understand if there are any terrible inefficiencies there...

kabirkbr commented 6 years ago

These may be relevant to check (by the order of decreasing importance) regarding DSE:

https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/dyingNodes.html
https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/gcPauses.html
https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/insufficientResources.html (I think this is already done, but does not harm to doublecheck);
https://docs.datastax.com/en/dse/6.0/dse-admin/datastax_enterprise/mgmtServices/performance/aboutPerformanceService.html
https://docs.datastax.com/en/dse/6.0/dse-dev/datastax_enterprise/config/configRecommendedSettings.html (again, these should be checked, but could be doublechecked...)

And Gremlin query optimization:

https://academy.datastax.com/content/dse-gremlin-queries-good-better-best;

kabirkbr commented 6 years ago

It seems that the problem is complex in the sense, that it may be due not to one component, but the whole ecosystem... A naive look to GC log file analysis it may not be the main problem. Here is the gceasy.io analysis for a run with above parameters (see report) -- the peak usage of the heap is far from reaching limits allocated to the dse. That seems to hint that it is not dse jvm's garbage collector who eats the memory...

This simulation crashed as all previous by reaching the limit of overall system memory:

kabirkbr commented 6 years ago

Finally managed to finish the mini-stressTesting with these parameters:

2 hours running time;
Creation of 1 agent per second;
JVM args: Note, that this is the decrease from 32 to 16GB (!?). Here, the the max_pc_pause_limit is set to maximum recommended), while default and minimum recommended is 500; It seems that 500 is not enough for GC to finish its job and hence it keeps repeating... just a hypothesis.

Full parameters are here:

2018-11-28 15:36:28,364 [Test worker] INFO  Experiments.class  :
method=stressTesting : 
experimentId=EXP11-28-03-36-WXmRtb : 
simulationId=SIM11-28-03-36-yHwSVk--ST : 
parameters={
similarityConnectThreshold=0.99, 
initialAgentNumber=50,
newAgentsPerSecond=1,
chainLenghts=[5, 10], 
maxSimilaritySearchDistance=20, 
maxCycleSearchDistance=15, 
similaritySearchThreshold=0.99, 
intialGraphType=smallWorld, 
timeoutStressTesting=7200000, 
randomPairsRatePerAgentPerDay=1.50, 
chainCreationRatePerAgentPerDay=0.140, 
randomCycleSearchRatePerAgentPerDay=3.0, 
targetedCycleSearchRatePerAgentPerDay=0.120, 
searchAndConnectRatePerAgentPerDay=1.0,
millisBetweenRandomPairsPerAgent=57600036.86402359, 
millisBetweenChainCreationPerAgent=617142998.2041139, 
millisBetweenRandomCycleSearchPerAgent=28800018.43201180, 
millisBetweenTargetedCycleSearchPerAgent=719994240.046080, 
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}

It creates 2853 agents during the whole simulation:

cat OfferNet.log | grep '<init>' | grep SIM11-28-03-36-yHwSVk--ST | wc -l
2853

Here are whole system loads:

and OfferNet ActorSystem loads:

The simulation finishes nicely in 2 hours somewhere around 17:38, but it creates only 2853 new agents out of about 60602 = 7200 (1 agent per second). The reason is that at the later stages of simulation, time needed to create agent increases from tens of ms to tens of seconds...:

kabir@Ubuntu-1804-bionic-64-minimal:~/offernet/logs$ cat OfferNet.log | grep Agent.class | grep '<init>' | grep SIM11-28-03-36-yHwSVk--ST | head -n 3
2018-11-28 15:36:38,934 [OfferNet-akka.actor.default-dispatcher-3] INFO  Agent.class  : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=ac99dfdb-dfd5-418c-acd7-e544d91fc9a4 : wallTime_ms=487 msec.
2018-11-28 15:36:38,993 [OfferNet-akka.actor.default-dispatcher-2] INFO  Agent.class  : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=1e47de98-7be2-4208-be97-0549d463927c : wallTime_ms=56 msec.
2018-11-28 15:36:39,044 [OfferNet-akka.actor.default-dispatcher-3] INFO  Agent.class  : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=14161496-8c84-4af0-b33a-15f97c764360 : wallTime_ms=50 msec.

[...]

kabir@Ubuntu-1804-bionic-64-minimal:~/offernet/logs$ cat OfferNet.log | grep Agent.class | grep '<init>' | grep SIM11-28-03-36-yHwSVk--ST | tail -n 3
2018-11-28 17:36:32,366 [OfferNet-akka.actor.default-dispatcher-23] INFO  Agent.class  : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=fa00ac00-bda7-423b-963d-61937ff80d83 : wallTime_ms=10813 msec.
2018-11-28 17:36:34,430 [OfferNet-akka.actor.default-dispatcher-21] INFO  Agent.class  : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=40934ae2-a76f-436f-b456-b55ec49d7ea3 : wallTime_ms=2063 msec.
2018-11-28 17:36:49,377 [OfferNet-akka.actor.default-dispatcher-21] INFO  Agent.class  : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=cb3d8c49-caad-4421-882c-874131f485b5 : wallTime_ms=14946 msec.

and that seems to be due to increasing read request latencies on dse:

which could be related to the following slow queries:

concretely:

1311 ms | 11/28/2018, 4:33:40 PM | 127.0.0.1 | offernet.item_p | offernet_actor_system | SELECT * FROM "offernet"."item_p" WHERE "community_id" = ? AND "member_id" = ? LIMIT ? ALLOW FILTERING;

1303 ms | 11/28/2018, 4:33:40 PM | 127.0.0.1 | offernet.item_e | offernet_actor_system | SELECT * FROM "offernet"."item_e" WHERE "community_id" = ? AND "member_id" = ? AND "~~edge_label_id" IN (?, ?) AND "~similarity" >= ? LIMIT ? ALLOW FILTERING;

1303 ms | 11/28/2018, 4:33:40 PM | 127.0.0.1 | offernet.agent_e | offernet_actor_system | SELECT * FROM "offernet"."agent_e" WHERE "community_id" = ? AND "member_id" = ? AND "~~edge_label_id" = ? LIMIT ? ALLOW FILTERING;

1206 ms | 11/28/2018, 4:21:43 PM | 127.0.0.1 | offernet.work_e | offernet_actor_system | SELECT * FROM "offernet"."work_e" WHERE "community_id" = ? AND "member_id" = ? AND "~~edge_label_id" = ? LIMIT ? ALLOW FILTERING;

1134 ms | 11/28/2018, 4:23:14 PM | 127.0.0.1 | offernet.agent_p | offernet_actor_system | SELECT * FROM "offernet"."agent_p" WHERE "community_id" = ? AND "member_id" = ? LIMIT ? ALLOW FILTERING;

1134 ms | 11/28/2018, 4:23:14 PM | 127.0.0.1 | offernet.agent_e | offernet_actor_system | SELECT * FROM "offernet"."agent_e" WHERE "community_id" = ? AND "member_id" = ? AND "~~edge_label_id" IN (?, ?) LIMIT ? ALLOW FILTERING;

[...]

Bottom line, all this could be related to:

some kind of read latency problems with dse;
more probably design problem within OfferNet ActorSystem; As it is now, agents are created via centralized actor called Simulation, which issues a lot of read queries to the graph to query for items, works, etc. of the graph;

So:

First thing that I would check in the codebase is all small queries that touch global indices;
Second -- or maybe this one is first -- follow recommendations in GC log reports see here.

Voila.

kabirkbr commented 5 years ago

Troubleshooting slow reads on DSE:

https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/slowReads.html

kabirkbr commented 5 years ago

https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/slowReads.html says, that:

Get the number of SSTables consulted for each read. Check this number with nodetool tablehistograms. A median value over 2 or 3 is likely causing problems.

The only table that incurs SSTables in item_e, which is 'similarity' edges between items, that is the most busy and large table, that can hold a few million of edges.

Some simulations (this one with creating 5 agents a second) gives this:

kabir@Ubuntu-1804-bionic-64-minimal:~$ nodetool tablehistograms -- offernet item_e
offernet/item_e histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             3.00              8.19             98.30              2299                60
75%             3.00             10.24            262.14              9887               258
95%             3.00             14.34            524.29             11864               372
98%             3.00             20.48            786.43             14237               372
99%             3.00             81.92           2097.15             14237               372
Min             0.00              3.07             16.38                73                 0
Max             3.00            229.38         671088.64             17084               446

Which is supposedly 'is likely creating problems'. Worth to check it from time to time, but it is not consistent -- most of the time all SSTables are 0...

singnet / offernet

Stress testing #66