Training Error?? - Githubissues

hendrycks / anomaly-seg

The Combined Anomalous Object Segmentation (CAOS) Benchmark

MIT License

154 stars 20 forks source link

Training Error?? #24

Closed edwardcho closed 2 years ago

edwardcho commented 2 years ago

Hello Sir,

Still, I couldn't solve my error. I am using your code and config/ade20k-resnet50dilated-ppm_deepsup.yaml, streethazards_train.tar.

When training was started,

[2022-01-04 04:41:01,710 INFO train.py line 249 15579] Outputing checkpoints to: ckpt/ade20k-resnet50dilated-ppm_deepsup
# samples: 5125
1 Epoch = 5000 iters
Traceback (most recent call last):
  File "/data/TESTBOARD/additional_networks/anomaly_detection/anomaly-seg/semantic-segmentation-pytorch/train.py", line 276, in <module>
    main(cfg, gpus)
  File "/data/TESTBOARD/additional_networks/anomaly_detection/anomaly-seg/semantic-segmentation-pytorch/train.py", line 202, in main
    train(segmentation_module, iterator_train, optimizers, history, epoch+1, cfg)
  File "/data/TESTBOARD/additional_networks/anomaly_detection/anomaly-seg/semantic-segmentation-pytorch/train.py", line 41, in train
    loss, acc = segmentation_module(batch_data)
  File "/home/mirero/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/TESTBOARD/additional_networks/anomaly_detection/anomaly-seg/semantic-segmentation-pytorch/models/models.py", line 34, in forward
    (pred, pred_deepsup) = self.decoder(self.encoder(feed_dict['img_data'], return_feature_maps=True))
TypeError: list indices must be integers or slices, not str

I wonder what is my fault?? Thanks, Edward Cho.

xksteven commented 2 years ago

I will double check tonight or tomorrow and get back to you.

edwardcho commented 2 years ago

Yes.. Thanks.. If you have some opinions for me, please tell me...

Thanks.

xksteven commented 2 years ago

Could you maybe provide some more context to your setup? I downloaded the code and data from scratch. Then followed the instructions on the README. Had to install yacs which as a dependency I forgot to mention, but I believe is mentioned in the submodule requirements then started the training.

python3 train.py --gpus 0-1   
[2022-01-06 23:14:56,706 INFO train.py line 240 1846380] Loaded configuration file config/ade20k-resnet50dilated-ppm_deepsup.yaml                                                                                           
[2022-01-06 23:14:56,706 INFO train.py line 241 1846380] Running with config:                                 
DATASET:                                                                                                       
 imgMaxSize: 1000                                                                                             
 imgSizes: (300, 375, 450, 525, 600)                                                                           
 list_train: ./data/training.odgt                                                                              
 list_val: ./data/validation.odgt                                                                              
 num_class: 150                                                                                               
 padding_constant: 8                                                                                          
 random_flip: True                                                                                            
 root_dataset: ./data/    
 segm_downsampling_rate: 8        
DIR: ckpt/ade20k-resnet50dilated-ppm_deepsup      
MODEL:               
 arch_decoder: ppm_deepsup         
 arch_encoder: resnet50dilated              
 fc_dim: 2048            
 weights_decoder:
 weights_encoder: 
 OOD:                                                                                                          
  exclude_back: False                                                                                          
  ood: msp                                                                                                    
  out_labels: (13,)                                                                                          
TEST:                                                                                                         
  batch_size: 1                                                                                               
  checkpoint: epoch_20.pth                                                                                    
  result: ./                                                                                                
TRAIN:                                                                                                        
  batch_size_per_gpu: 2                                                                                        
  beta1: 0.9                                                                                                  
  deep_sup_scale: 0.4                                                                                         
  disp_iter: 20                                                                                                 
  epoch_iters: 5000                                                                                            
  fix_bn: False                                                                                              
  lr_decoder: 0.02                                                                                           
  lr_encoder: 0.02                                                                                           
  lr_pow: 0.9                                                                                                 
  num_epoch: 20                                                                                              
  optim: SGD                                                                                                 
  seed: 304                                                                                                   
  start_epoch: 0                                                                                             
  weight_decay: 0.0001                                                                                     
  workers: 16                                                                                              
VAL:                                                                                                          
  batch_size: 1                                                                                               
  checkpoint: epoch_20.pth                                                                                     
  visualize: False                                                                                            
[2022-01-06 23:14:56,706 INFO train.py line 246 1846380] Outputing checkpoints to: ckpt/ade20k-resnet50dilated-ppm_deepsup                                                                                                  
# samples: 5125                                                                                               
1 Epoch = 5000 iters                                                                                          
Epoch: [1][0/5000], Time: 9.62, Data: 2.50, lr_encoder: 0.020000, lr_decoder: 0.020000, Accuracy: 0.66, Loss: 7.690948                                                                                                      
Epoch: [1][20/5000], Time: 1.21, Data: 0.16, lr_encoder: 0.019996, lr_decoder: 0.019996, Accuracy: 70.52, Loss: 2.431588                                                                                                    
Epoch: [1][40/5000], Time: 0.96, Data: 0.10, lr_encoder: 0.019993, lr_decoder: 0.019993, Accuracy: 76.97, Loss: 1.634942

It seems to be training successfully.

xksteven commented 2 years ago

Will close soon unless I get updated with more information. Otherwise I cannot reproduce your issue.

LT1st commented 2 years ago

Facing the same issue, have u BEBUG it yet?

xksteven commented 2 years ago

@LT1st can you describe your steps or what you did?

xksteven commented 2 years ago

It seems the issue is with training with one GPU. I'll update the Readme and previous issue. The solution is something along these lines https://github.com/CSAILVision/semantic-segmentation-pytorch/issues/58 But I haven't tested single GPU support and not sure when I'll be able to test it. Maybe sometime next week.