Abrupt training slow down issue

noman-git commented 6 months ago

Hi, I've been using this library for the GraphSageDGL/GraphSage model and it has worked fine for months. However, recently I needed to create multiple models for different data, and I encountered this slowdown during training. At first I thought it was due to my server having less compute due to multiple inferences running on it, so I moved to a new one but to no effect.

The GraphSageDGL training usually happens at a speed of 22 it/s (7 it/s for GraphSage). Sometimes it keeps this speed and completes for 100 epochs. However sometimes (usually after half of 1st epoch or at the beginning of the 2nd) it slows down to 1-1.5 it/s. Sometimes it goes back and forth between 22 and 1, sometimes it keeps 1 it/s throughout the process. Thanks to this, the model that used to be trained in 40 minutes is still training after 15 hours.

I thought this might be a DGL problem so I tried the vanilla GraphSage and it also has the same problem, the speed goes down from 7 it/s to 1-1.5 it/s at some random epoch and then follows the same behaviour as the DGL one.

I have tried both the CPU and Cuda version of Pytorch (I am not using GPU so the CUDA version also defaults to CPU) and both reproduce the same problem. However as I mentioned it happens randomly and sometimes does not happen at all.

Also it never happens on my macbook m2 pro where the process is a bit slower (15 it/s) but always consistent. This issue happens on the Amazon EC2 server which has 8 CPUS (x86_64) and 32GB RAM. I checked and no other background process was running during the training.

Here are the libraries I am using: dgl==1.1.2 LibRecommender==1.3.0 loguru==0.7.2 pandas==2.1.1 psycopg2-binary==2.9.7 python-dotenv==1.0.0 scikit-learn==1.3.1 scipy==1.11.2 tensorflow==2.14.0 torch==2.1.0 torchvision torchaudio boto3==1.28.55 botocore==1.31.55 python-dotenv==1.0.0 Screenshot 2024-05-30 at 5 59 52 PM

massquantity commented 6 months ago

Is it possible that the EC2 server encounters some peak period from time to time, so the computing resources become limited?

noman-git commented 6 months ago

Is it possible that the EC2 server encounters some peak period from time to time, so the computing resources become limited?

That is not possible, it is a dedicated server and the compute is either available or the server completely terminates. Also I have nothing else on the server except the training code and the conda installation for my python environment.

massquantity commented 6 months ago

That's what the AWS advertisement suggests. Based on your description, I couldn't think of any other reasons.

massquantity commented 6 months ago

This is the ChatGPT answer to why dedicated instance in ec2 becomes slow:

There are several reasons why a dedicated instance in Amazon EC2 might become slow at times. Even though dedicated instances are supposed to provide more consistent performance compared to shared instances, various factors can still impact their performance:

Resource Contention: Although dedicated instances ensure that the physical server is not shared with other AWS customers, resource contention can still occur if there are multiple instances on the same host competing for resources such as CPU, memory, disk I/O, or network bandwidth.

Network Bottlenecks: Network performance can be affected by various factors, including network congestion within the AWS infrastructure, bandwidth limitations, or network configuration issues.

Disk I/O Performance: If your application is heavily dependent on disk operations, the performance of your instance can be impacted by the underlying EBS (Elastic Block Store) performance. EBS volumes have IOPS (Input/Output Operations Per Second) limits, and reaching these limits can slow down your instance.

Instance Type Limitations: Each EC2 instance type has specific hardware characteristics and performance limits. If your workload exceeds the capacity of your chosen instance type, you might experience performance degradation. Upgrading to a more powerful instance type might be necessary.

CPU Credits (for T2/T3 Instances): If you're using burstable performance instances like T2 or T3, performance can degrade when the CPU credits are exhausted. These instances are designed to provide consistent baseline performance with the ability to burst above the baseline, but only if sufficient CPU credits are available.

Background Processes: Background processes or daemons running on your instance can consume resources, leading to slower performance. Regularly monitoring and managing these processes is crucial to maintaining optimal performance.

Memory Leaks: Applications running on your instance might have memory leaks, leading to increased memory usage over time and eventual performance degradation.

Application Bottlenecks: Inefficient code, poorly optimized databases, and other application-specific issues can cause slow performance. Profiling and optimizing your application might be necessary to address these issues.

AWS Maintenance: AWS occasionally performs maintenance on its infrastructure, which can temporarily impact instance performance. While such maintenance is usually communicated in advance and scheduled during off-peak hours, it can still cause temporary performance issues.

Operating System and Software Configuration: Misconfiguration of the operating system or application software can lead to suboptimal performance. Ensuring that your software stack is properly configured and optimized is essential.

Security Groups and Network ACLs: Incorrect or overly restrictive security group or network ACL (Access Control List) configurations can lead to network performance issues.

massquantity / LibRecommender

Abrupt training slow down issue #474