train_loss variable is None and val_loss variable in log not showing up

rraju1 commented 2 years ago

Hi,

Thank you for the awesome project and the training script. I was able to replicate the result for resnet18 for 16 epochs (as per the resnet18 dataset settings), which came out to roughly the same accuracy. My question is related to the train_loss variable coming out as None and the validation loss isn't being recorded in the log. Based on the top1/5 accuracies, it looks like it is working but it would still be nice to have both losses logged though. Can you confirm if this is the expected behavior?

=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.06347999721765518, 'top_5': 0.18308000266551971, 'val_time': 9.070343971252441, 'train_loss': None, 'epoch': 0}

Thanks!

lengstrom commented 2 years ago

What log_level are you running the script with? Can you please post the config box that gets printed out when you run the script?

rraju1 commented 2 years ago

I ran the script with log_level = 1. The exact command I used to run is the following: python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \ --data.train_dataset=/staging/groups/lipasti_group/train_400_0.10_90.ffcv \ --data.val_dataset=/staging/groups/lipasti_group/val_400_0.10_90.ffcv \ --data.num_workers=12 --data.in_memory=1 \ --logging.folder=".". And I don't know if it's relevant but I used ./write_imagenet.sh 400 0.10 90 to create the ffcv files.

 Arguments defined────────┬──────────────────────────────────────────────────────┐
│ Parameter                │ Value                                                │
├──────────────────────────┼──────────────────────────────────────────────────────┤
│ model.arch               │ resnet18                                             │
│ model.pretrained         │ 0                                                    │
│ resolution.min_res       │ 160                                                  │
│ resolution.max_res       │ 192                                                  │
│ resolution.end_ramp      │ 13                                                   │
│ resolution.start_ramp    │ 11                                                   │
│ data.train_dataset       │ /staging/groups/lipasti_group/train_400_0.10_90.ffcv │
│ data.val_dataset         │ /staging/groups/lipasti_group/val_400_0.10_90.ffcv   │
│ data.num_workers         │ 12                                                   │
│ data.in_memory           │ 1                                                    │
│ lr.step_ratio            │ 0.1                                                  │
│ lr.step_length           │ 30                                                   │
│ lr.lr_schedule_type      │ cyclic                                               │
│ lr.lr                    │ 0.5                                                  │
│ lr.lr_peak_epoch         │ 2                                                    │
│ logging.folder           │ .                                                    │
│ logging.log_level        │ 1                                                    │
│ validation.batch_size    │ 512                                                  │
│ validation.resolution    │ 256                                                  │
│ validation.lr_tta        │ 1                                                    │
│ training.eval_only       │ 0                                                    │
│ training.batch_size      │ 1024                                                 │
│ training.optimizer       │ sgd                                                  │
│ training.momentum        │ 0.9                                                  │
│ training.weight_decay    │ 5e-05                                                │
│ training.epochs          │ 16                                                   │
│ training.label_smoothing │ 0.1                                                  │
│ training.distributed     │ 0                                                    │
│ training.use_blurpool    │ 1                                                    │
│ dist.world_size          │ 1                                                    │
│ dist.address             │ localhost                                            │
│ dist.port                │ 12355                                                │
└──────────────────────────┴──────────────────────────────────────────────────────┘

lengstrom commented 2 years ago

What output do you get in ./log?

rraju1 commented 2 years ago

Below is the output


Running job
=> Logging in /var/lib/condor/execute/slot1/dir_27650/2dbc9849-6bef-4d25-8dff-d3f250cc9d78
=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.09058000147342682, 'top_5': 0.2343199998140335, 'val_time': 9.118002891540527, 'train_loss': None, 'epoch': 0}
=> Log: {'current_lr': 0.49980017985611513, 'top_1': 0.1768600046634674, 'top_5': 0.38374000787734985, 'val_time': 7.200252532958984, 'train_loss': None, 'epoch': 1}
=> Log: {'current_lr': 0.464314262875414, 'top_1': 0.25446000695228577, 'top_5': 0.5013999938964844, 'val_time': 6.987864017486572, 'train_loss': None, 'epoch': 2}
=> Log: {'current_lr': 0.4285999771611283, 'top_1': 0.2940399944782257, 'top_5': 0.5546000003814697, 'val_time': 6.982339382171631, 'train_loss': None, 'epoch': 3}
=> Log: {'current_lr': 0.3928856914468425, 'top_1': 0.3174000084400177, 'top_5': 0.5855200290679932, 'val_time': 7.011723279953003, 'train_loss': None, 'epoch': 4}
=> Log: {'current_lr': 0.3571714057325568, 'top_1': 0.355459988117218, 'top_5': 0.6302800178527832, 'val_time': 7.0547261238098145, 'train_loss': None, 'epoch': 5}
=> Log: {'current_lr': 0.3214571200182711, 'top_1': 0.4002799987792969, 'top_5': 0.6700000166893005, 'val_time': 7.045661926269531, 'train_loss': None, 'epoch': 6}
=> Log: {'current_lr': 0.2857428343039854, 'top_1': 0.4112600088119507, 'top_5': 0.6878399848937988, 'val_time': 7.064192056655884, 'train_loss': None, 'epoch': 7}
=> Log: {'current_lr': 0.2500285485896997, 'top_1': 0.4200800061225891, 'top_5': 0.6910600066184998, 'val_time': 6.991353750228882, 'train_loss': None, 'epoch': 8}
=> Log: {'current_lr': 0.21431426287541397, 'top_1': 0.45730000734329224, 'top_5': 0.7239199876785278, 'val_time': 7.05776834487915, 'train_loss': None, 'epoch': 9}
=> Log: {'current_lr': 0.17859997716112827, 'top_1': 0.48787999153137207, 'top_5': 0.7565799951553345, 'val_time': 6.991325855255127, 'train_loss': None, 'epoch': 10}
=> Log: {'current_lr': 0.14288569144684257, 'top_1': 0.5102400183677673, 'top_5': 0.7673799991607666, 'val_time': 6.966805934906006, 'train_loss': None, 'epoch': 11}
=> Log: {'current_lr': 0.10717140573255682, 'top_1': 0.5567200183868408, 'top_5': 0.8069599866867065, 'val_time': 7.033263683319092, 'train_loss': None, 'epoch': 12}
=> Log: {'current_lr': 0.07145712001827112, 'top_1': 0.5875599980354309, 'top_5': 0.8281400203704834, 'val_time': 7.092468738555908, 'train_loss': None, 'epoch': 13}
=> Log: {'current_lr': 0.03574283430398542, 'top_1': 0.6301400065422058, 'top_5': 0.8547599911689758, 'val_time': 7.117802619934082, 'train_loss': None, 'epoch': 14}
=> Log: {'current_lr': 2.8548589699667337e-05, 'top_1': 0.6671800017356873, 'top_5': 0.8743799924850464, 'val_time': 7.025460720062256, 'train_loss': None, 'epoch': 15}
=> Log: {'current_lr': 2.8548589699667337e-05, 'top_1': 0.6671800017356873, 'top_5': 0.8743799924850464, 'val_time': 7.002474308013916, 'epoch': 15, 'total time': 2081.8419053554535}

GeekAlexis commented 2 years ago

I have the same issue. It looks like train_loop doesn't return the loss at all.

libffcv / ffcv-imagenet

train_loss variable is None and val_loss variable in log not showing up #5