[$500] Check in on slowness reported

jmgasper commented 3 years ago

@atelomycterus - We're getting sporadic reports of slowness in the forums from regular members. Can you check tideways please?

From member Gassa:

When I am logged out, things are indeed fast. However, when I am logged in, every page of the forums loads for 30+ seconds.
https://discussions.topcoder.com/discussion/7581/excited-to-see-the-new-forums
From member ged:

But they don't work fast as advertise and cheered by others, at least to me. Is there something going on today, maybe related to traffic (recent announcement sent out)?
From member binchan

I also feel the new forum is quite slow, can take up almost 40 seconds just to load the page. It's a major turn off for people to visit the forum and post.

jmgasper commented 3 years ago

Challenge https://www.topcoder.com/challenges/16a2ddb4-0dbc-4003-8c75-17c766f75151 has been created for this ticket.

This is an automated message for ghostar via Topcoder X

jmgasper commented 3 years ago

Challenge https://www.topcoder.com/challenges/16a2ddb4-0dbc-4003-8c75-17c766f75151 has been assigned to obog.

This is an automated message for ghostar via Topcoder X

atelomycterus commented 3 years ago

@jmgasper I am working on it.

atelomycterus commented 3 years ago

@jmgasper All requests with the response time >= 30 secs have the high Unaccounted Wait . Unaccounted Wait means that the PHP process is waiting for I/O, however the amount of time cannot be accounted for by data from the timeline trace, such as SQL queries or HTTP calls. This is often the case when the server is under high load, and the PHP process executing the request is not getting resources from the CPU to execute, effectively sleeping / waiting for the CPU.

To find the cause for the unaccounted wait time, I'm checking Tideways documentation how add dynamic tracepoints to generate a callgraph that contains all the PHP functions called during a request. We can temporarily boost the amount of traces for a short amount of time. We can configure and manage it from Tideways.

For example, removing an user from a group should delete a record from Gdn_UserGroup and update a cache. The normal trace looks like this:

In many requests (no matter what request url), you can see Gdn_UserGroup (data is used to check permissions). The data has been deleted from Memcached and then it is loaded again.

In some traces you can see GC calls. The summary doesn't show all calls. All requests have the high Unaccounted Wait:

Keep you updated.

jmgasper commented 3 years ago

@atelomycterus - Sounds good, thanks.

atelomycterus commented 3 years ago

@jmgasper The number of categories (challenges)/users/etc have grown in recent months. The resources required to complete requests are increasing. For example, on average 35 MB of memory was required to execute a request in April, but now it is more than 50 MB for the same request.

| April | May | June (6 June) -- | -- | -- | -- Requests | 225K-250K per week | 250K-290K per week | 300K Avg of Request memory | 35MB | 43MB | 53 MB Max Request Memory | 50 MB | 65MB | 90 MB

Sporadic reports of slowness in the forums:

Date | Count of requests with response time >30 sec+ (up to 1-2,5 mins) -- | -- 5.06 | 23 2.06 | 9 1.06 | 190 31.05 | 43 30.05 | 9

Slow responses especially after actions that clear cache (e.g. adding/updating categories). This also happens when there are many requests at the same time, but from the information that is in Tideways and GC is running, I can conclude that there are not enough resources to service requests.

Example1: slow responses after adding new category (new challenges added) Now we have 3375 and adding a new category takes about 10-12 secs.

Rebuild a tree, clear Categories cache:

The same issue with updating a category:

Example 2: Cached data is increasing. ‘Get’ requests builds breadcrumbs:

Load all categories from Categories cache. 23MB should be unserialized to build breadcrumbs.

There are some steps to be taken

Monitor CPU /RAM usage To further focus on improving performance, need to look at server resources such as RAM and CPU. Unfortunately, there is no such detailed data with Tideways. Does Topcoder monitor CPU/RAM usage on PROD?
Tweak PHP-FPM parameters PHP-FPM has a default configuration. Now it’s not optimal. Need to tweak PHP-FPM parameters. PHP-FPM has a lot of configuration parameters which determine the way it performs. These parameters have to be determined based on available server resources such as RAM and CPU. This step should be done with monitoring CPU/RAM usage.
Optimize code The more data, the slower it will work. In particular, categories are all loaded from the database and stored in the cache. Adding/updating a category is just an example. Here are some already known places that need to be optimized or refactored significantly. This step should be done with monitoring CPU/RAM usage.

jmgasper commented 3 years ago

@atelomycterus - Thanks for this. I'll see if I can get details on what sort of CPU / RAM is available in prod.

atelomycterus commented 3 years ago

@jmgasper Also we need to know how many CPU cores. In Dynamic mode, PHP-FPM dynamically manages the number of available child processes.

The following formula can be used to calculate the values for each setting: max_children=(Total RAM – Memory used for Linux, DB, Memcached, etc.) / process size start_servers=Number of CPU cores x 4 min_spare_servers = Number of CPU cores x 2 max_spare_servers =Same as start_servers

The configuration uses configuration options: pm.max_children - The maximum number of child processes allowed to be spawned. pm.start_servers - The number of child processes to start when PHP-FPM starts. pm.min_spare_servers - The minimum number of idle child processes PHP-FPM will create. More are created if fewer than this number are available. pm.max_spare_servers - The maximum number of idle child processes PHP-FPM will create. If there are more child processes available than this value, then some will be killed off. pm.process_idle_timeout - The idle time, in seconds, after which a child process will be killed.

jmgasper commented 3 years ago

@atelomycterus - These are the details of the VMs in dev and prod:

DEV
m5.large    8.0 GiB 2 vCPUs
Number of task : 1
Task cpu/RAM:
ECS Service CPU Unit 100
ECS Service memory  1GB

PROD
m5.xlarge   16.0 GiB    4 vCPUs
Number of task : 1
Task cpu/RAM:
ECS Service CPU Unit 100
ECS Service memory  1GB

atelomycterus commented 3 years ago

@jmgasper A few questions:

Have QA engineers developed load/stress tests for Vanilla?
Can Topcoder provide access to CloudWatch to monitor CPU/Memory usage of an EC2?

jmgasper commented 3 years ago

I do think we have some stress tests: https://github.com/topcoder-platform/forums-performance-tests
I'll ask

sdgun commented 3 years ago

@jmgasper https://github.com/topcoder-platform/forums-performance-tests is a UI level data input script I created to check discussion loading times with multiple comments. I think we have to do a proper API level performance test to cover the current requirement. I know Lakshmi and the the team is doing performance tests for required applications. We might have to contact them for this requirement.

atelomycterus commented 3 years ago

@jmgasper CPU is specified in units of cores. As a baseline ECS considers each vCPU available to an EC2 container instance as 1024 units. For example, 2 vCPUs has 2,048 available CPU units to schedule out. The task only specifies 100 CPU units (0.097 vCPU). 0.097 vCPU will reserve for the container.

From official document:

a value for Task memory (GB) and Task CPU (vCPU). The table below shows the valid combinations.

Without understanding how the system behaves at the current settings, it makes no sense to tweak php-frm options.

Each minute, the Amazon ECS container agent on each container instance calculates the number of CPU units and MB of memory that are currently being used for each task owned by the service that is running on that container instance, and this information is reported back to Amazon ECS. The total amount of CPU and memory used for all tasks owned by the service that are running on the cluster is calculated, and those numbers are reported to CloudWatch as a percentage of the total resources that are specified for the service in the service's task definition.

It's important to understand what Docker CPUUtilization/MemoryUtilization is for the forum . Could Topcoder provide screens for May/June from AWS? This data is available the “Metrics” tab on the ECS Service details page in AWS Console. It does give us a quick snapshot of how the forum is looking, whether the forums is bottlenecked by CPU, etc.

Statistics

| April | May | June (10 June) | July -- | -- | -- | -- | -- Users | 4816 | 6264 | 6844 | Categories | 2789 | 3363 | 3539 | Max number of requests per hour | 4668(21.04) | 13748(10.05) | 5814(1.06) | Avg number of requests per hour,depends on a hour | 800-2000 | 1000-2800 | 1100-2800 | How long is average request | 95 % - < 700 ms | 95 % - < 700 ms | 95 % - < 700 ms | How long is Max request | | | (Last 3 days) 43.2s, We’ve seen some requests with response time up to 2,5 mins | Maximum number of simultaneous visitors | No data in Tideways | No data in Tideways | No data in Tideways | Max Memory per request | 50MB | 65MB | 90MB |

The list of most frequent requests and memory consumption per request

jmgasper commented 3 years ago

@Gunasekar-K - Can you provide more details please? ☝️

jmgasper commented 3 years ago

@atelomycterus - I got this for prod utilisation of the container. Seems concerning to me:

image (4)

atelomycterus commented 3 years ago

@jmgasper Obviously, there is a problem. From one screen there is no complete understanding of the whole picture. It is not clear from this chart where the value of 100% is for CPUUtilization. The reason for high usage (> 100% ) is the poor resource configuration of the cluster or the service. There are a lot of spikes it means that insufficient resources have been allocated to the forum (task). Need to increase the reservations to allow optimal resource usage.

It is not clear from this chart where the value of 100% is for MemoryUtilization.

I looked at the data in tideways:

This is an example how CPU reservation should be calculated: If the service now uses 200% CPU. Need to adjust reserved resources for the service using the following formula: (X / 100) Y 1.3. This means that we should reserve (200 / 100) 100 1.3 = 260 CPU units.

X — current CPU utilization (percentage)
Y — value you specified in Task parameters. Now CPU Unit is 100 (https://github.com/topcoder-platform/forums/issues/627#issuecomment-857418782).
1.3 — buffer capacity you want to reserve for workload spikes Please note 1.3 value may not fit everyone’s needs, it’s just a baseline.

The performance issues were on 1 June from Tideways. In this date there were 200+ requests with high response times (30 secs -2.5 mins). Need to analyze data for the previous weeks. Please have a look at statistics: Average, Minimum, Maximum, 95 percentiles. Perhaps there were higher CPU usage because there were more activity on the forum. It’s essential to specify correct values for CPU/Memory. Ignoring this rule leads to the unstable work of the forum itself and its neighbors (tasks that are running on the same instance).

For further steps, please provide access to statistics or Topcoder administrators need to check on the resource usage of a cluster and resource usage at the individual service-level.

How calculate resources reservation

Specify resources you think your service/task will require
Launch it in cluster and put some workload on it or wait for 3-7 days to get statistics
CPU Utilization: Go to the “Metrics” tab on the ECS Service details page in AWS Console and analyze data.
Memory Utilization: Go to the “Metrics” tab on the ECS Service details page in AWS Console and analyze data.
Apply changes and deploy new version of task definition.
Monitor it and repeat steps starting from Step 1 again if needed.

jmgasper commented 3 years ago

@atelomycterus - Thanks for that - I'm pushing this onto Topcoder to try to figure out. I'll open up a new ticket after we've bumped everything up and tested fully. It would be nice if I had access to all this stuff, but unfortunately I don't...

jmgasper commented 3 years ago

Payment task has been updated: https://www.topcoder.com/challenges/16a2ddb4-0dbc-4003-8c75-17c766f75151 Payments Complete Winner: obog Copilot: ghostar Challenge 16a2ddb4-0dbc-4003-8c75-17c766f75151 has been paid and closed.

This is an automated message for ghostar via Topcoder X

atelomycterus commented 3 years ago

@jmgasper Ok. Thanks!

topcoder-platform / forums