Closed jmgasper closed 3 years ago
Challenge https://www.topcoder.com/challenges/16a2ddb4-0dbc-4003-8c75-17c766f75151 has been created for this ticket.This is an automated message for ghostar via Topcoder X
Challenge https://www.topcoder.com/challenges/16a2ddb4-0dbc-4003-8c75-17c766f75151 has been assigned to obog.This is an automated message for ghostar via Topcoder X
@jmgasper I am working on it.
@jmgasper All requests with the response time >= 30 secs have the high Unaccounted Wait
.
Unaccounted Wait
means that the PHP process is waiting for I/O, however the amount of time cannot be accounted for by data from the timeline trace, such as SQL queries or HTTP calls. This is often the case when the server is under high load, and the PHP process executing the request is not getting resources from the CPU to execute, effectively sleeping / waiting for the CPU.
To find the cause for the unaccounted wait time, I'm checking Tideways documentation how add dynamic tracepoints to generate a callgraph that contains all the PHP functions called during a request. We can temporarily boost the amount of traces for a short amount of time. We can configure and manage it from Tideways.
For example, removing an user from a group should delete a record from Gdn_UserGroup and update a cache. The normal trace looks like this:
In many requests (no matter what request url), you can see Gdn_UserGroup (data is used to check permissions). The data has been deleted from Memcached and then it is loaded again.
In some traces you can see GC calls. The summary doesn't show all calls. All requests have the high Unaccounted Wait
:
Keep you updated.
@atelomycterus - Sounds good, thanks.
@jmgasper The number of categories (challenges)/users/etc have grown in recent months. The resources required to complete requests are increasing. For example, on average 35 MB of memory was required to execute a request in April, but now it is more than 50 MB for the same request.
Sporadic reports of slowness in the forums:
Slow responses especially after actions that clear cache (e.g. adding/updating categories). This also happens when there are many requests at the same time, but from the information that is in Tideways and GC is running, I can conclude that there are not enough resources to service requests.
Example1: slow responses after adding new category (new challenges added) Now we have 3375 and adding a new category takes about 10-12 secs.
Rebuild a tree, clear Categories cache:
The same issue with updating a category:
Example 2: Cached data is increasing. ‘Get’ requests builds breadcrumbs:
Monitor CPU /RAM usage To further focus on improving performance, need to look at server resources such as RAM and CPU. Unfortunately, there is no such detailed data with Tideways. Does Topcoder monitor CPU/RAM usage on PROD?
Tweak PHP-FPM parameters PHP-FPM has a default configuration. Now it’s not optimal. Need to tweak PHP-FPM parameters. PHP-FPM has a lot of configuration parameters which determine the way it performs. These parameters have to be determined based on available server resources such as RAM and CPU. This step should be done with monitoring CPU/RAM usage.
Optimize code The more data, the slower it will work. In particular, categories are all loaded from the database and stored in the cache. Adding/updating a category is just an example. Here are some already known places that need to be optimized or refactored significantly. This step should be done with monitoring CPU/RAM usage.
@atelomycterus - Thanks for this. I'll see if I can get details on what sort of CPU / RAM is available in prod.
@jmgasper Also we need to know how many CPU cores. In Dynamic mode, PHP-FPM dynamically manages the number of available child processes.
The following formula can be used to calculate the values for each setting:
max_children
=(Total RAM – Memory used for Linux, DB, Memcached, etc.) / process size
start_servers
=Number of CPU cores x 4
min_spare_servers
= Number of CPU cores x 2
max_spare_servers
=Same as start_servers
The configuration uses configuration options:
pm.max_children
- The maximum number of child processes allowed to be spawned.
pm.start_servers
- The number of child processes to start when PHP-FPM starts.
pm.min_spare_servers
- The minimum number of idle child processes PHP-FPM will create. More are created if fewer than this number are available.
pm.max_spare_servers
- The maximum number of idle child processes PHP-FPM will create. If there are more child processes available than this value, then some will be killed off.
pm.process_idle_timeout
- The idle time, in seconds, after which a child process will be killed.
@atelomycterus - These are the details of the VMs in dev and prod:
DEV
m5.large 8.0 GiB 2 vCPUs
Number of task : 1
Task cpu/RAM:
ECS Service CPU Unit 100
ECS Service memory 1GB
PROD
m5.xlarge 16.0 GiB 4 vCPUs
Number of task : 1
Task cpu/RAM:
ECS Service CPU Unit 100
ECS Service memory 1GB
@jmgasper A few questions:
@jmgasper https://github.com/topcoder-platform/forums-performance-tests is a UI level data input script I created to check discussion loading times with multiple comments. I think we have to do a proper API level performance test to cover the current requirement. I know Lakshmi and the the team is doing performance tests for required applications. We might have to contact them for this requirement.
@jmgasper CPU is specified in units of cores. As a baseline ECS considers each vCPU available to an EC2 container instance as 1024 units. For example, 2 vCPUs has 2,048 available CPU units to schedule out. The task only specifies 100 CPU units (0.097 vCPU). 0.097 vCPU will reserve for the container.
From official document:
a value for Task memory (GB) and Task CPU (vCPU). The table below shows the valid combinations.
Without understanding how the system behaves at the current settings, it makes no sense to tweak php-frm options.
Each minute, the Amazon ECS container agent on each container instance calculates the number of CPU units and MB of memory that are currently being used for each task owned by the service that is running on that container instance, and this information is reported back to Amazon ECS. The total amount of CPU and memory used for all tasks owned by the service that are running on the cluster is calculated, and those numbers are reported to CloudWatch as a percentage of the total resources that are specified for the service in the service's task definition.
It's important to understand what Docker CPUUtilization/MemoryUtilization is for the forum . Could Topcoder provide screens for May/June from AWS? This data is available the “Metrics” tab on the ECS Service details page in AWS Console. It does give us a quick snapshot of how the forum is looking, whether the forums is bottlenecked by CPU, etc.
The list of most frequent requests and memory consumption per request
@Gunasekar-K - Can you provide more details please? ☝️
@atelomycterus - I got this for prod utilisation of the container. Seems concerning to me:
@jmgasper Obviously, there is a problem. From one screen there is no complete understanding of the whole picture.
It is not clear from this chart where the value of 100%
is for CPUUtilization
.
The reason for high usage (> 100% ) is the poor resource configuration of the cluster or the service.
There are a lot of spikes it means that insufficient resources have been allocated to the forum (task). Need to increase the reservations to allow optimal resource usage.
It is not clear from this chart where the value of 100%
is for MemoryUtilization
.
I looked at the data in tideways:
This is an example how CPU reservation should be calculated: If the service now uses 200% CPU. Need to adjust reserved resources for the service using the following formula: (X / 100) Y 1.3. This means that we should reserve (200 / 100) 100 1.3 = 260 CPU units.
1.3
value may not fit everyone’s needs, it’s just a baseline.The performance issues were on 1 June from Tideways. In this date there were 200+ requests with high response times (30 secs -2.5 mins). Need to analyze data for the previous weeks. Please have a look at statistics: Average, Minimum, Maximum, 95 percentiles. Perhaps there were higher CPU usage because there were more activity on the forum. It’s essential to specify correct values for CPU/Memory. Ignoring this rule leads to the unstable work of the forum itself and its neighbors (tasks that are running on the same instance).
For further steps, please provide access to statistics or Topcoder administrators need to check on the resource usage of a cluster and resource usage at the individual service-level.
@atelomycterus - Thanks for that - I'm pushing this onto Topcoder to try to figure out. I'll open up a new ticket after we've bumped everything up and tested fully. It would be nice if I had access to all this stuff, but unfortunately I don't...
Payment task has been updated: https://www.topcoder.com/challenges/16a2ddb4-0dbc-4003-8c75-17c766f75151
Payments Complete
Winner: obog
Copilot: ghostar
Challenge 16a2ddb4-0dbc-4003-8c75-17c766f75151
has been paid and closed.This is an automated message for ghostar via Topcoder X
@jmgasper Ok. Thanks!
@atelomycterus - We're getting sporadic reports of slowness in the forums from regular members. Can you check tideways please?
From member
Gassa
:https://discussions.topcoder.com/discussion/7581/excited-to-see-the-new-forums
From member
ged
:From member
binchan