Open kabirkbr opened 6 years ago
Running stressTesting code on the basis of PR #67 seem to indicate memory leaks either in Agent.class or Simulation.class. Three indicative runs with the following parameters:
018-11-21 13:24:14,116 [Test worker] INFO Experiments.class :
method=stressTesting :
message=after storing agentId and vertexId in Agent.class as parameters; may solve the timeout of getVertexId(); :
experimentId=EXP11-21-01-24-VBMaha :
simulationId=SIM11-21-01-24-HcHRNr--ST :
parameters={
similarityConnectThreshold=0.99,
initialAgentNumber=50,
newAgentsPerSecond=1,
chainLenghts=[5, 10],
maxSimilaritySearchDistance=20,
maxCycleSearchDistance=15,
similaritySearchThreshold=0.99,
intialGraphType=smallWorld,
timeoutStressTesting=7200000,
randomPairsRatePerAgentPerDay=1.50,
chainCreationRatePerAgentPerDay=0.140,
randomCycleSearchRatePerAgentPerDay=3.0,
targetedCycleSearchRatePerAgentPerDay=0.120,
searchAndConnectRatePerAgentPerDay=1.0,
millisBetweenRandomPairsPerAgent=57600036.86402359,
millisBetweenChainCreationPerAgent=617142998.2041139,
millisBetweenRandomCycleSearchPerAgent=28800018.43201180,
millisBetweenTargetedCycleSearchPerAgent=719994240.046080,
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}
2018-11-21 15:03:38,813 [Test worker] INFO Experiments.class :
method=stressTesting :
message=after storing agentId and vertexId in Agent.class as parameters; may solve the timeout of getVertexId(); :
experimentId=EXP11-21-03-03-NWmsEL :
simulationId=SIM11-21-03-03-udbAEI--ST :
parameters={
similarityConnectThreshold=0.99,
initialAgentNumber=50,
newAgentsPerSecond=1,
chainLenghts=[5, 10],
maxSimilaritySearchDistance=20,
maxCycleSearchDistance=15,
similaritySearchThreshold=0.99,
intialGraphType=smallWorld,
timeoutStressTesting=7200000,
randomPairsRatePerAgentPerDay=1.50,
chainCreationRatePerAgentPerDay=0.140,
randomCycleSearchRatePerAgentPerDay=3.0,
targetedCycleSearchRatePerAgentPerDay=0.120,
searchAndConnectRatePerAgentPerDay=1.0,
millisBetweenRandomPairsPerAgent=57600036.86402359,
millisBetweenChainCreationPerAgent=617142998.2041139,
millisBetweenRandomCycleSearchPerAgent=28800018.43201180,
millisBetweenTargetedCycleSearchPerAgent=719994240.046080,
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}
2018-11-21 19:45:27,940 [Test worker] INFO Experiments.class :
method=stressTesting :
message=after storing agentId and vertexId in Agent.class as parameters; may solve the timeout of getVertexId(); :
experimentId=EXP11-21-07-45-sIfu52 :
simulationId=SIM11-21-07-45-6DijXy--ST :
parameters={
similarityConnectThreshold=0.99,
initialAgentNumber=50,
newAgentsPerSecond=1,
chainLenghts=[5, 10],
maxSimilaritySearchDistance=20,
maxCycleSearchDistance=15,
similaritySearchThreshold=0.99,
intialGraphType=smallWorld,
timeoutStressTesting=7200000,
randomPairsRatePerAgentPerDay=1.50,
chainCreationRatePerAgentPerDay=0.140,
randomCycleSearchRatePerAgentPerDay=3.0,
targetedCycleSearchRatePerAgentPerDay=0.120,
searchAndConnectRatePerAgentPerDay=1.0,
millisBetweenRandomPairsPerAgent=57600036.86402359,
millisBetweenChainCreationPerAgent=617142998.2041139,
millisBetweenRandomCycleSearchPerAgent=28800018.43201180,
millisBetweenTargetedCycleSearchPerAgent=719994240.046080,
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}
crash after reaching 1500 - 2000 agents:
cat OfferNet.log | grep '<init>' | grep SIM11-21-01-24-HcHRNr--ST | wc -l
2005
cat OfferNet.log | grep '<init>' | grep SIM11-21-03-03-udbAEI--ST | wc -l
1655
cat OfferNet.log | grep '<init>' | grep SIM11-21-07-45-6DijXy--ST | wc -l
1324
all system resources reach the maximum (all three simulations are shown in the same graph, peak indicate times of crash):
First simulation crashed because of the timeout in actor system; the two last simulations crashed when java heap filled all availabe RAM. First thing to check was how many RAM is allocated to which processes; DSE uses half of system's memory (32GB); Elastic Stack infrastructure uses another 10GB, and offernet code fills all the rest when running. But this does not seem correct -- Agents should not take this amount of memory.
This is what JVisualVM says so far (for the third simulation):
It seams from the heap histogram that most of the memory is taken by LinkedHashMaps and all kinds of arrays, which are probably global structures. Will need to track them down and optimize...
To do next:
Wish me luck.
By the way, 'crash' means that DSE crashes, usually performing a Garbage Collect. So it could be that the problem is in DSE rather than in ActorSystem, but I would opt for the latter first, and then will check DSE if other ideas lead nowhere...
It may be that after all the bottleneck is DSE CPU and memory consumption. After putting changing DSE JVM requirements to min 16GB and max 32GB:
2018-11-22 12:05:02,489 [Test worker] INFO Experiments.class : m
ethod=stressTesting :
experimentId=EXP11-22-12-05-YKEqRd :
simulationId=SIM11-22-12-05-26YYjN--ST :
parameters={
similarityConnectThreshold=0.99,
initialAgentNumber=50,
newAgentsPerSecond=1,
chainLenghts=[5, 10],
maxSimilaritySearchDistance=20,
maxCycleSearchDistance=15,
similaritySearchThreshold=0.99,
intialGraphType=smallWorld,
timeoutStressTesting=7200000,
randomPairsRatePerAgentPerDay=1.50,
chainCreationRatePerAgentPerDay=0.140,
randomCycleSearchRatePerAgentPerDay=3.0,
targetedCycleSearchRatePerAgentPerDay=0.120,
searchAndConnectRatePerAgentPerDay=1.0,
millisBetweenRandomPairsPerAgent=57600036.86402359,
millisBetweenChainCreationPerAgent=617142998.2041139,
millisBetweenRandomCycleSearchPerAgent=28800018.43201180,
millisBetweenTargetedCycleSearchPerAgent=719994240.046080,
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}
the situation is the same and simulation crashes after creating 2000 agents:
cat ~/offernet/logs/OfferNet.log | grep '<init>' | grep SIM11-22-12-05-26YYjN--ST | wc -l
1990
Here is how the full system load looks like:
And here is how OfferNet ActorSystem load looks like:
Further, cassandra log file shows that the crash happened when performing garbage collect:
cat /var/log/cassandra/debug.log | tail -n 50
[...]
INFO [Service Thread] 2018-11-22 12:50:31,134 GCInspector.java:290 - G1 Young Generation GC in 382ms. G1 Eden Space: 15250489344 -> 0; G1 Old Gen: 10234102064 -> 10791944504; G1 Survivor Space: 1308622848 -> 1207959552;
DEBUG [ScheduledTasks:1] 2018-11-22 12:50:31,441 MonitoringTask.java:173 - 3 operations were slow in the last 5000 msecs:
<SELECT * FROM offernet.work_e WHERE community_id = 1496222336 AND member_id = 49 AND ~~edge_label_id = 65550 LIMIT 50000>, time 505 msec - slow timeout 500 msec
<SELECT * FROM offernet.item_e WHERE community_id = 2052914560 AND member_id = 189 AND ~~edge_label_id = 65553 LIMIT 50000>, time 505 msec - slow timeout 500 msec
<SELECT * FROM offernet.item_e WHERE community_id = 2140990336 AND member_id = 212 AND ~~edge_label_id = 65553 LIMIT 50000>, time 504 msec - slow timeout 500 msec
INFO [Service Thread] 2018-11-22 12:50:33,567 GCInspector.java:290 - G1 Young Generation GC in 439ms. G1 Eden Space: 13103005696 -> 0; G1 Old Gen: 10787750224 -> 11412701512; G1 Survivor Space: 1207959552 -> 1107296256;
INFO [Service Thread] 2018-11-22 12:50:36,812 GCInspector.java:290 - G1 Young Generation GC in 352ms. G1 Eden Space: 17230200832 -> 0; G1 Old Gen: 11408507232 -> 11416896344; G1 Survivor Space: 1107296256 -> 1627389952;
INFO [Service Thread] 2018-11-22 12:50:39,112 GCInspector.java:290 - G1 Young Generation GC in 477ms. G1 Eden Space: 16181624832 -> 0; G1 Old Gen: 11412702064 -> 12020875632; G1 Survivor Space: 1627389952 -> 1660944384;
INFO [Service Thread] 2018-11-22 12:50:41,169 GCInspector.java:290 - G1 Young Generation GC in 543ms. G1 Eden Space: 10301210624 -> 0; G1 Old Gen: 12016681352 -> 13040091528; G1 Survivor Space: 1660944384 -> 1065353216;
INFO [Service Thread] 2018-11-22 12:50:44,044 GCInspector.java:290 - G1 Young Generation GC in 371ms. G1 Eden Space: 14319353856 -> 0; G1 Old Gen: 13035897248 -> 13677625768; G1 Survivor Space: 1065353216 -> 964689920;
INFO [Service Thread] 2018-11-22 12:50:47,024 GCInspector.java:290 - G1 Young Generation GC in 532ms. G1 Eden Space: 15233712128 -> 0; G1 Old Gen: 13673431488 -> 13681820600; G1 Survivor Space: 964689920 -> 1610612736;
INFO [Service Thread] 2018-11-22 12:50:48,698 GCInspector.java:290 - G1 Young Generation GC in 447ms. G1 Eden Space: 5242880000 -> 0; G1 Old Gen: 13681820600 -> 15279849928; G1 Survivor Space: 1610612736 -> 285212672;
INFO [Service Thread] 2018-11-22 12:50:51,492 GCInspector.java:290 - G1 Young Generation GC in 286ms. G1 Eden Space: 14562623488 -> 0; G1 Old Gen: 15275655648 -> 15284044760; G1 Survivor Space: 285212672 -> 830472192;
DEBUG [ScheduledTasks:1] 2018-11-22 12:50:51,499 MonitoringTask.java:173 - 4 operations were slow in the last 4763 msecs:
<SELECT * FROM offernet.agent_e WHERE community_id = 553039488 AND member_id = 17 AND ~~edge_label_id IN (65556, 65557) LIMIT 50000>, time 561 msec - slow timeout 500 msec
<SELECT * FROM offernet.agent_e WHERE community_id = 1879330176 AND member_id = 13 AND ~~edge_label_id IN (65556, 65557) LIMIT 50000>, time 567 msec - slow timeout 500 msec
<SELECT * FROM offernet.item_p WHERE community_id = 328333824 AND member_id = 142 LIMIT 50000>, time 561 msec - slow timeout 500 msec
<SELECT * FROM offernet.item_e WHERE community_id = 1058980096 AND member_id = 282 AND ~~edge_label_id = 65553 LIMIT 50000>, time 578 msec - slow timeout 500 msec
INFO [Service Thread] 2018-11-22 12:50:53,990 GCInspector.java:290 - G1 Young Generation GC in 336ms. G1 Eden Space: 13883146240 -> 0; G1 Old Gen: 15279850480 -> 15288239088; G1 Survivor Space: 830472192 -> 1400897536;
INFO [Service Thread] 2018-11-22 12:50:55,826 GCInspector.java:290 - G1 Young Generation GC in 447ms. G1 Eden Space: 8631877632 -> 0; G1 Old Gen: 15288239088 -> 16097739248; G1 Survivor Space: 1400897536 -> 914358272;
[...]
To do: -- figure out if there could be DSE performance issues; -- profile gremlin queries in order to understand if there are any terrible inefficiencies there...
These may be relevant to check (by the order of decreasing importance) regarding DSE:
And Gremlin query optimization:
It seems that the problem is complex in the sense, that it may be due not to one component, but the whole ecosystem... A naive look to GC log file analysis it may not be the main problem. Here is the gceasy.io analysis for a run with above parameters (see report) -- the peak usage of the heap is far from reaching limits allocated to the dse. That seems to hint that it is not dse jvm's garbage collector who eats the memory...
This simulation crashed as all previous by reaching the limit of overall system memory:
Finally managed to finish the mini-stressTesting with these parameters:
Full parameters are here:
2018-11-28 15:36:28,364 [Test worker] INFO Experiments.class :
method=stressTesting :
experimentId=EXP11-28-03-36-WXmRtb :
simulationId=SIM11-28-03-36-yHwSVk--ST :
parameters={
similarityConnectThreshold=0.99,
initialAgentNumber=50,
newAgentsPerSecond=1,
chainLenghts=[5, 10],
maxSimilaritySearchDistance=20,
maxCycleSearchDistance=15,
similaritySearchThreshold=0.99,
intialGraphType=smallWorld,
timeoutStressTesting=7200000,
randomPairsRatePerAgentPerDay=1.50,
chainCreationRatePerAgentPerDay=0.140,
randomCycleSearchRatePerAgentPerDay=3.0,
targetedCycleSearchRatePerAgentPerDay=0.120,
searchAndConnectRatePerAgentPerDay=1.0,
millisBetweenRandomPairsPerAgent=57600036.86402359,
millisBetweenChainCreationPerAgent=617142998.2041139,
millisBetweenRandomCycleSearchPerAgent=28800018.43201180,
millisBetweenTargetedCycleSearchPerAgent=719994240.046080,
millisBetweenSearchAndConnectPerAgent=86400000.5529600035}
It creates 2853 agents during the whole simulation:
cat OfferNet.log | grep '<init>' | grep SIM11-28-03-36-yHwSVk--ST | wc -l
2853
Here are whole system loads:
and OfferNet ActorSystem loads:
The simulation finishes nicely in 2 hours somewhere around 17:38, but it creates only 2853 new agents out of about 60602 = 7200 (1 agent per second). The reason is that at the later stages of simulation, time needed to create agent increases from tens of ms to tens of seconds...:
kabir@Ubuntu-1804-bionic-64-minimal:~/offernet/logs$ cat OfferNet.log | grep Agent.class | grep '<init>' | grep SIM11-28-03-36-yHwSVk--ST | head -n 3
2018-11-28 15:36:38,934 [OfferNet-akka.actor.default-dispatcher-3] INFO Agent.class : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=ac99dfdb-dfd5-418c-acd7-e544d91fc9a4 : wallTime_ms=487 msec.
2018-11-28 15:36:38,993 [OfferNet-akka.actor.default-dispatcher-2] INFO Agent.class : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=1e47de98-7be2-4208-be97-0549d463927c : wallTime_ms=56 msec.
2018-11-28 15:36:39,044 [OfferNet-akka.actor.default-dispatcher-3] INFO Agent.class : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=14161496-8c84-4af0-b33a-15f97c764360 : wallTime_ms=50 msec.
[...]
kabir@Ubuntu-1804-bionic-64-minimal:~/offernet/logs$ cat OfferNet.log | grep Agent.class | grep '<init>' | grep SIM11-28-03-36-yHwSVk--ST | tail -n 3
2018-11-28 17:36:32,366 [OfferNet-akka.actor.default-dispatcher-23] INFO Agent.class : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=fa00ac00-bda7-423b-963d-61937ff80d83 : wallTime_ms=10813 msec.
2018-11-28 17:36:34,430 [OfferNet-akka.actor.default-dispatcher-21] INFO Agent.class : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=40934ae2-a76f-436f-b456-b55ec49d7ea3 : wallTime_ms=2063 msec.
2018-11-28 17:36:49,377 [OfferNet-akka.actor.default-dispatcher-21] INFO Agent.class : method=<init> : simulationId=SIM11-28-03-36-yHwSVk--ST : agentId=cb3d8c49-caad-4421-882c-874131f485b5 : wallTime_ms=14946 msec.
and that seems to be due to increasing read request latencies on dse:
which could be related to the following slow queries:
concretely:
1311 ms | 11/28/2018, 4:33:40 PM | 127.0.0.1 | offernet.item_p | offernet_actor_system | SELECT * FROM "offernet"."item_p" WHERE "community_id" = ? AND "member_id" = ? LIMIT ? ALLOW FILTERING;
1303 ms | 11/28/2018, 4:33:40 PM | 127.0.0.1 | offernet.item_e | offernet_actor_system | SELECT * FROM "offernet"."item_e" WHERE "community_id" = ? AND "member_id" = ? AND "~~edge_label_id" IN (?, ?) AND "~similarity" >= ? LIMIT ? ALLOW FILTERING;
1303 ms | 11/28/2018, 4:33:40 PM | 127.0.0.1 | offernet.agent_e | offernet_actor_system | SELECT * FROM "offernet"."agent_e" WHERE "community_id" = ? AND "member_id" = ? AND "~~edge_label_id" = ? LIMIT ? ALLOW FILTERING;
1206 ms | 11/28/2018, 4:21:43 PM | 127.0.0.1 | offernet.work_e | offernet_actor_system | SELECT * FROM "offernet"."work_e" WHERE "community_id" = ? AND "member_id" = ? AND "~~edge_label_id" = ? LIMIT ? ALLOW FILTERING;
1134 ms | 11/28/2018, 4:23:14 PM | 127.0.0.1 | offernet.agent_p | offernet_actor_system | SELECT * FROM "offernet"."agent_p" WHERE "community_id" = ? AND "member_id" = ? LIMIT ? ALLOW FILTERING;
1134 ms | 11/28/2018, 4:23:14 PM | 127.0.0.1 | offernet.agent_e | offernet_actor_system | SELECT * FROM "offernet"."agent_e" WHERE "community_id" = ? AND "member_id" = ? AND "~~edge_label_id" IN (?, ?) LIMIT ? ALLOW FILTERING;
[...]
Bottom line, all this could be related to:
So:
Voila.
Troubleshooting slow reads on DSE:
https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/slowReads.html says, that:
Get the number of SSTables consulted for each read. Check this number with nodetool tablehistograms. A median value over 2 or 3 is likely causing problems.
The only table that incurs SSTables in item_e, which is 'similarity' edges between items, that is the most busy and large table, that can hold a few million of edges.
Some simulations (this one with creating 5 agents a second) gives this:
kabir@Ubuntu-1804-bionic-64-minimal:~$ nodetool tablehistograms -- offernet item_e
offernet/item_e histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 3.00 8.19 98.30 2299 60
75% 3.00 10.24 262.14 9887 258
95% 3.00 14.34 524.29 11864 372
98% 3.00 20.48 786.43 14237 372
99% 3.00 81.92 2097.15 14237 372
Min 0.00 3.07 16.38 73 0
Max 3.00 229.38 671088.64 17084 446
Which is supposedly 'is likely creating problems'. Worth to check it from time to time, but it is not consistent -- most of the time all SSTables are 0...
Stress testing of the current framework would involve persistently running new cycle generation and search processes on a moderately large network (1-5 million agents), based on the code for milestone 0.0.1. All processes will be decentralized. The goal is to push the scale of simulations (in terms of number of actors, number of links between them, number events per second etc) to somewhere that may resemble real setting.
The minimum scale of stress test simulation would be: number of agents: 1M; number of active daily agents: 100K; 'active daily agents' are those which put new offer-demand pairs or those who initiate search processes.
The program maximum would be reach the scale of the network when the framework cannot cope with the size and load anymore...