30hrs for training one epoch for flatvel_a

Kaustav546 commented 1 month ago

The training time for the flatvel_a dataset is very high. It takes approximately 30 hours to complete one epoch using the command: python train.py -d 'cuda:0' -ds flatvel-a -n YOUR_DIRECTORY -r CHECKPOINT.PTH -m InversionNet -g2v 0 --tensorboard -t flatvel_a_train.txt -v flatvel_a_val.txt.

This long duration is due to the utilization of GPU memory, specifically a single Nvidia L4 24GB GPU. Can you provide insights into what might be causing this extended training time?

Shihang-LANL commented 1 month ago

Can you provide more details about the setting of your running? It shouldn't take so long. You may reduce the samples you load in flatvel_a_train.txt and give a try. An example can be found at https://colab.research.google.com/drive/17s5JmVs9ABl8MpmFlhWMSslj9_d5Atfx?usp=sharing. One epoch should be finished in seconds.

Kaustav546 commented 1 month ago

here are the details: Namespace(device='cuda', dataset='flatvel-a', file_size=None, anno_path='split_files', train_anno='split_files/flatvel_a_train.txt', val_anno='split_files/flatvel_a_val.txt', output_path='/home/deep_koust/FWI_Pretraining/', log_path='/home/deep_koust/FWI_Pretraining/', save_name='/home/deep_koust/FWI_Pretraining', suffix=None, model='InversionNet', up_mode=None, sample_spatial=1.0, sample_temporal=1, batch_size=256, lr=0.0001, lr_milestones=[], momentum=0.9, weight_decay=0.0001, lr_gamma=0.1, lr_warmup_epochs=0, epoch_block=40, num_block=3, workers=16, k=1, print_freq=50, resume=None, start_epoch=0, lambda_g1v=1.0, lambda_g2v=0.0, sync_bn=False, world_size=1, dist_url='env://', tensorboard=True, epochs=120) torch version: 2.3.0 torchvision version: 0.18.0 Not using distributed mode Loading data Loading training data Loading validation data Creating data loaders /opt/conda/envs/fwi/lib/python3.12/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 16 worker processes in total. Our suggested max number of worker in current system is 8, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( Creating model Start training Epoch: [0] [ 0/93] eta: 2 days, 12:21:43 lr: 0.0001 samples/s: 6.597 loss: 0.6823 (0.6823) loss_g1v: 0.6823 (0.6823) loss_g2v: 0.6820 (0.6820) time: 2336.5991 data: 2297.7812 max mem: 19461 Killed

lanl / OpenFWI

30hrs for training one epoch for flatvel_a #8