Open rraju1 opened 2 years ago
What log_level
are you running the script with? Can you please post the config box that gets printed out when you run the script?
I ran the script with log_level = 1
. The exact command I used to run is the following: python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \ --data.train_dataset=/staging/groups/lipasti_group/train_400_0.10_90.ffcv \ --data.val_dataset=/staging/groups/lipasti_group/val_400_0.10_90.ffcv \ --data.num_workers=12 --data.in_memory=1 \ --logging.folder="."
. And I don't know if it's relevant but I used ./write_imagenet.sh 400 0.10 90
to create the ffcv files.
Arguments defined────────┬──────────────────────────────────────────────────────┐
│ Parameter │ Value │
├──────────────────────────┼──────────────────────────────────────────────────────┤
│ model.arch │ resnet18 │
│ model.pretrained │ 0 │
│ resolution.min_res │ 160 │
│ resolution.max_res │ 192 │
│ resolution.end_ramp │ 13 │
│ resolution.start_ramp │ 11 │
│ data.train_dataset │ /staging/groups/lipasti_group/train_400_0.10_90.ffcv │
│ data.val_dataset │ /staging/groups/lipasti_group/val_400_0.10_90.ffcv │
│ data.num_workers │ 12 │
│ data.in_memory │ 1 │
│ lr.step_ratio │ 0.1 │
│ lr.step_length │ 30 │
│ lr.lr_schedule_type │ cyclic │
│ lr.lr │ 0.5 │
│ lr.lr_peak_epoch │ 2 │
│ logging.folder │ . │
│ logging.log_level │ 1 │
│ validation.batch_size │ 512 │
│ validation.resolution │ 256 │
│ validation.lr_tta │ 1 │
│ training.eval_only │ 0 │
│ training.batch_size │ 1024 │
│ training.optimizer │ sgd │
│ training.momentum │ 0.9 │
│ training.weight_decay │ 5e-05 │
│ training.epochs │ 16 │
│ training.label_smoothing │ 0.1 │
│ training.distributed │ 0 │
│ training.use_blurpool │ 1 │
│ dist.world_size │ 1 │
│ dist.address │ localhost │
│ dist.port │ 12355 │
└──────────────────────────┴──────────────────────────────────────────────────────┘
What output do you get in ./log
?
Below is the output
Running job
=> Logging in /var/lib/condor/execute/slot1/dir_27650/2dbc9849-6bef-4d25-8dff-d3f250cc9d78
=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.09058000147342682, 'top_5': 0.2343199998140335, 'val_time': 9.118002891540527, 'train_loss': None, 'epoch': 0}
=> Log: {'current_lr': 0.49980017985611513, 'top_1': 0.1768600046634674, 'top_5': 0.38374000787734985, 'val_time': 7.200252532958984, 'train_loss': None, 'epoch': 1}
=> Log: {'current_lr': 0.464314262875414, 'top_1': 0.25446000695228577, 'top_5': 0.5013999938964844, 'val_time': 6.987864017486572, 'train_loss': None, 'epoch': 2}
=> Log: {'current_lr': 0.4285999771611283, 'top_1': 0.2940399944782257, 'top_5': 0.5546000003814697, 'val_time': 6.982339382171631, 'train_loss': None, 'epoch': 3}
=> Log: {'current_lr': 0.3928856914468425, 'top_1': 0.3174000084400177, 'top_5': 0.5855200290679932, 'val_time': 7.011723279953003, 'train_loss': None, 'epoch': 4}
=> Log: {'current_lr': 0.3571714057325568, 'top_1': 0.355459988117218, 'top_5': 0.6302800178527832, 'val_time': 7.0547261238098145, 'train_loss': None, 'epoch': 5}
=> Log: {'current_lr': 0.3214571200182711, 'top_1': 0.4002799987792969, 'top_5': 0.6700000166893005, 'val_time': 7.045661926269531, 'train_loss': None, 'epoch': 6}
=> Log: {'current_lr': 0.2857428343039854, 'top_1': 0.4112600088119507, 'top_5': 0.6878399848937988, 'val_time': 7.064192056655884, 'train_loss': None, 'epoch': 7}
=> Log: {'current_lr': 0.2500285485896997, 'top_1': 0.4200800061225891, 'top_5': 0.6910600066184998, 'val_time': 6.991353750228882, 'train_loss': None, 'epoch': 8}
=> Log: {'current_lr': 0.21431426287541397, 'top_1': 0.45730000734329224, 'top_5': 0.7239199876785278, 'val_time': 7.05776834487915, 'train_loss': None, 'epoch': 9}
=> Log: {'current_lr': 0.17859997716112827, 'top_1': 0.48787999153137207, 'top_5': 0.7565799951553345, 'val_time': 6.991325855255127, 'train_loss': None, 'epoch': 10}
=> Log: {'current_lr': 0.14288569144684257, 'top_1': 0.5102400183677673, 'top_5': 0.7673799991607666, 'val_time': 6.966805934906006, 'train_loss': None, 'epoch': 11}
=> Log: {'current_lr': 0.10717140573255682, 'top_1': 0.5567200183868408, 'top_5': 0.8069599866867065, 'val_time': 7.033263683319092, 'train_loss': None, 'epoch': 12}
=> Log: {'current_lr': 0.07145712001827112, 'top_1': 0.5875599980354309, 'top_5': 0.8281400203704834, 'val_time': 7.092468738555908, 'train_loss': None, 'epoch': 13}
=> Log: {'current_lr': 0.03574283430398542, 'top_1': 0.6301400065422058, 'top_5': 0.8547599911689758, 'val_time': 7.117802619934082, 'train_loss': None, 'epoch': 14}
=> Log: {'current_lr': 2.8548589699667337e-05, 'top_1': 0.6671800017356873, 'top_5': 0.8743799924850464, 'val_time': 7.025460720062256, 'train_loss': None, 'epoch': 15}
=> Log: {'current_lr': 2.8548589699667337e-05, 'top_1': 0.6671800017356873, 'top_5': 0.8743799924850464, 'val_time': 7.002474308013916, 'epoch': 15, 'total time': 2081.8419053554535}
I have the same issue. It looks like train_loop
doesn't return the loss at all.
Hi,
Thank you for the awesome project and the training script. I was able to replicate the result for resnet18 for 16 epochs (as per the resnet18 dataset settings), which came out to roughly the same accuracy. My question is related to the train_loss variable coming out as None and the validation loss isn't being recorded in the log. Based on the top1/5 accuracies, it looks like it is working but it would still be nice to have both losses logged though. Can you confirm if this is the expected behavior?
=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.06347999721765518, 'top_5': 0.18308000266551971, 'val_time': 9.070343971252441, 'train_loss': None, 'epoch': 0}
Thanks!