intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System
Other
1.25k stars 157 forks source link

[Feature]: Summarize the elapsed time of PyTorch ops in a training job. #664

Open workingloong opened 1 year ago

workingloong commented 1 year ago

Users usually need to detect the bottleneck of the training pipeline by viewing the elapsed time of ops. If we can automatically summarize the elapsed time after the training starts, we can automatically detect the bottleneck and make efforts to mitigate the bottleneck or give some suggestions to users.

created-Bi commented 1 year ago

import time

def train(): for i, epoch in enumerate(range(start_epoch, end_epoch)): for train_sample in train_data_loader: start_time = time.time() doing... print('Time consuming: {}s'.format(time.time() - start_time))

github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has not had recent activity.