lijiaman / CASENet

30 stars 2 forks source link

Running Error #2

Open FengLoveBella opened 6 years ago

FengLoveBella commented 6 years ago

When I am running your code, and I encounter the following error,

config:Namespace(batch_size=1, checkpoint_folder='./checkpoint', cls_num=20, epochs=150, lr=1e-07, lr_steps=[10000, 20000, 30000, 40000], momentum=0.9, multigpu=False, pretrained_model='', print_freq=1, resume_model='', start_epoch=0, weight_decay=0.0005, workers=16) ('train_dataset len', 42490) Totally new layer:score_edge_side1 Totally new layer:score_edge_side2 Totally new layer:score_edge_side3 Totally new layer:score_cls_side5 Totally new layer:ce_fusion label_name: label_name: Traceback (most recent call last): File "/home/fengzhou/CASENet/main.py", line 129, in main() File "/home/fengzhou/CASENet/main.py", line 84, in main global_step = model_play.train(args, train_loader, model, optimizer, epoch, curr_lr, win_feats5, win_fusion, viz, global_step) File "/home/fengzhou/CASENet/train_val/model_play.py", line 31, in train for i, (img, target) in enumerate(train_loader): File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 281, in next return self._process_next_batch(batch) File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 301, in _process_next_batch raise batch.exc_type(batch.exc_msg) KeyError: 'Traceback (most recent call last):\n File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 55, in _worker_loop\n samples = collate_fn([dataset[i] for i in batch_indices])\n File "/home/fengzhou/CASENet/dataloader/SBD_data.py", line 61, in getitem\n np_data = self.h5_f[\'data/\'+labelname.replace(\'/\', \'\').replace(\'bin\', \'npy\')]\n File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2577)\n File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2536)\n File "/usr/lib/python2.7/dist-packages/h5py/_hl/group.py", line 166, in getitem\n oid = h5o.open(self.id, self._e(name), lapl=self._lapl)\n File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2577)\n File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/_objects.c:2536)\n File "h5py/h5o.pyx", line 190, in h5py.h5o.open (/build/h5py-nQFNYZ/h5py-2.6.0/h5py/h5o.c:3407)\nKeyError: \'Unable to open object (Bad object header version number)\'\n'

Did you encounter it before?? Thank you very much. @lijiaman

lijiaman commented 6 years ago

seems issues about workers. set workers to 1 may fix it. (I'd suggest to change data input as image format instead of hdf5. multiple workers seems causing issues for hdf5 file)

FengLoveBella commented 6 years ago

@lijiaman Yes, I set workers to 1, it is ok now, but the total loss is extremely high, about 2000000, and there is no trend to decrease, it is normal?

FengLoveBella commented 6 years ago

@lijiaman screenshot from 2018-07-17 17-32-25 I follow your code, and I encounter this bug, and I am not sure it is a bug of training dataset or a bug of training network. I am looking forward to your reply.

mengxingkong commented 5 years ago

@zhoufengbuaa Hai, recently, I have to reproduce CASENET, when i run this repository, I also meet the problem(learning rate is much high), had you resolved it? Hope that you can help me. Looking forward to your reply. Screenshot from 2019-04-15 16-19-08

shoutOutYangJie commented 5 years ago

@zhoufengbuaa Hai, recently, I have to reproduce CASENET, when i run this repository, I also meet the problem(learning rate is much high), had you resolved it? Hope that you can help me. Looking forward to your reply. Screenshot from 2019-04-15 16-19-08

hi, have you been tested it? how does it perform?