jindongwang / transferlearning

Transfer learning / domain adaptation / domain generalization / multi-task learning etc. Papers, codes, datasets, applications, tutorials.-迁移学习
http://transferlearning.xyz/
MIT License
13.24k stars 3.8k forks source link

运行deepcoral训练中途报错 #399

Closed tangkail closed 1 year ago

tangkail commented 1 year ago

王老师,您好,最近刚入门域适应方法,在测试coral方法时训练跑通了,但是经常的中途会报如下错误,之前看网上说是num_workers的问题,我的设备是单GPU-GTX1080TI,num_workers设置为1,本想调整为0尝试但很长时间都没反应,请问一下是否有解决方法,多谢老师!除此之外,还有个疑问是,域适应是要等到total_loss收敛时即test_acc较稳定时的acc取值才算有可信度吧,要是在目标域上的acc波动很大不收敛的话可能是由于什么原因?如有回答,不胜感激,拜谢! Namespace(backbone='resnet50', batch_size=2, config='DeepCoral/DeepCoral.yaml', data_dir='3class', device=device(type='cuda'), early_stop=0, epoch_based_training=False, lr=0.001, lr_decay=0.75, lr_gamma=0.0003, lr_scheduler=True, momentum=0.9, n_epoch=50, n_iter_per_epoch=500, num_workers=1, seed=1, src_domain='r3', tgt_domain='r20', transfer_loss='coral', transfer_loss_weight=10.0, use_bottleneck=True, weight_decay=0.0005) Epoch: [ 1/50], cls_loss: 1.0726, transfer_loss: 0.0000, total_Loss: 1.0730, test_loss 1.061854, test_acc: 40.3333 Epoch: [ 2/50], cls_loss: 0.9751, transfer_loss: 0.0003, total_Loss: 0.9779, test_loss 0.943620, test_acc: 52.6667 Epoch: [ 3/50], cls_loss: 0.9557, transfer_loss: 0.0004, total_Loss: 0.9597, test_loss 0.991058, test_acc: 56.8333 Epoch: [ 4/50], cls_loss: 0.8930, transfer_loss: 0.0005, total_Loss: 0.8979, test_loss 1.245606, test_acc: 51.8333 Epoch: [ 5/50], cls_loss: 0.8951, transfer_loss: 0.0006, total_Loss: 0.9007, test_loss 1.092104, test_acc: 56.3333 Epoch: [ 6/50], cls_loss: 0.8538, transfer_loss: 0.0007, total_Loss: 0.8604, test_loss 1.166265, test_acc: 57.0000 Epoch: [ 7/50], cls_loss: 0.8045, transfer_loss: 0.0007, total_Loss: 0.8116, test_loss 1.352956, test_acc: 53.1667 Epoch: [ 8/50], cls_loss: 0.7933, transfer_loss: 0.0008, total_Loss: 0.8010, test_loss 1.631142, test_acc: 52.0000 Epoch: [ 9/50], cls_loss: 0.7762, transfer_loss: 0.0007, total_Loss: 0.7833, test_loss 1.952749, test_acc: 43.6667 Epoch: [10/50], cls_loss: 0.7284, transfer_loss: 0.0011, total_Loss: 0.7391, test_loss 2.168818, test_acc: 46.0000 Epoch: [11/50], cls_loss: 0.7286, transfer_loss: 0.0011, total_Loss: 0.7392, test_loss 2.116339, test_acc: 51.0000 Epoch: [12/50], cls_loss: 0.7032, transfer_loss: 0.0010, total_Loss: 0.7131, test_loss 1.928333, test_acc: 51.6667 Epoch: [13/50], cls_loss: 0.6878, transfer_loss: 0.0010, total_Loss: 0.6974, test_loss 1.854418, test_acc: 49.5000 Epoch: [14/50], cls_loss: 0.6601, transfer_loss: 0.0011, total_Loss: 0.6711, test_loss 1.921440, test_acc: 52.8333 Epoch: [15/50], cls_loss: 0.6539, transfer_loss: 0.0012, total_Loss: 0.6657, test_loss 2.460182, test_acc: 45.6667 Epoch: [16/50], cls_loss: 0.6452, transfer_loss: 0.0010, total_Loss: 0.6553, test_loss 2.151258, test_acc: 52.8333 Epoch: [17/50], cls_loss: 0.5925, transfer_loss: 0.0013, total_Loss: 0.6051, test_loss 2.466823, test_acc: 49.8333 Epoch: [18/50], cls_loss: 0.5999, transfer_loss: 0.0010, total_Loss: 0.6095, test_loss 2.225831, test_acc: 50.6667 Epoch: [19/50], cls_loss: 0.6024, transfer_loss: 0.0010, total_Loss: 0.6128, test_loss 2.507036, test_acc: 49.0000 Epoch: [20/50], cls_loss: 0.6015, transfer_loss: 0.0011, total_Loss: 0.6130, test_loss 3.139492, test_acc: 43.3333 Epoch: [21/50], cls_loss: 0.5563, transfer_loss: 0.0011, total_Loss: 0.5669, test_loss 2.387875, test_acc: 47.5000 Epoch: [22/50], cls_loss: 0.5242, transfer_loss: 0.0012, total_Loss: 0.5360, test_loss 3.395393, test_acc: 41.1667 Epoch: [23/50], cls_loss: 0.5312, transfer_loss: 0.0011, total_Loss: 0.5422, test_loss 2.934170, test_acc: 43.3333 Epoch: [24/50], cls_loss: 0.5208, transfer_loss: 0.0011, total_Loss: 0.5317, test_loss 2.750415, test_acc: 46.8333 Epoch: [25/50], cls_loss: 0.4946, transfer_loss: 0.0013, total_Loss: 0.5073, test_loss 2.844258, test_acc: 45.1667 Epoch: [26/50], cls_loss: 0.4855, transfer_loss: 0.0012, total_Loss: 0.4979, test_loss 3.304321, test_acc: 46.1667 Epoch: [27/50], cls_loss: 0.4891, transfer_loss: 0.0013, total_Loss: 0.5020, test_loss 3.284366, test_acc: 43.8333 Epoch: [28/50], cls_loss: 0.4689, transfer_loss: 0.0012, total_Loss: 0.4808, test_loss 3.098105, test_acc: 45.5000 Epoch: [29/50], cls_loss: 0.4467, transfer_loss: 0.0012, total_Loss: 0.4592, test_loss 3.048647, test_acc: 46.0000 Traceback (most recent call last): File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/resource_sharer.py", line 149, in _serve send(conn, destination_pid) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/resource_sharer.py", line 50, in send reduction.send_handle(conn, new_fd, pid) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/reduction.py", line 180, in send_handle sendfds(s, [handle]) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/socket.py", line 160, in exit self.close() File "/home/kc501/anaconda3/envs/DA/lib/python3.7/socket.py", line 420, in close self._real_close() File "/home/kc501/anaconda3/envs/DA/lib/python3.7/socket.py", line 414, in _real_close _ss.close(self) OSError: [Errno 9] Bad file descriptor Epoch: [30/50], cls_loss: 0.4442, transfer_loss: 0.0012, total_Loss: 0.4565, test_loss 2.654798, test_acc: 48.8333 Epoch: [31/50], cls_loss: 0.4277, transfer_loss: 0.0013, total_Loss: 0.4411, test_loss 3.459523, test_acc: 45.5000 Traceback (most recent call last): File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/resource_sharer.py", line 149, in _serve send(conn, destination_pid) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/resource_sharer.py", line 50, in send reduction.send_handle(conn, new_fd, pid) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/reduction.py", line 180, in send_handle sendfds(s, [handle]) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/reduction.py", line 145, in sendfds sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)]) OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/resource_sharer.py", line 151, in _serve close() File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/resource_sharer.py", line 52, in close os.close(new_fd) OSError: [Errno 9] Bad file descriptor Traceback (most recent call last): File "main.py", line 194, in main() File "main.py", line 190, in main train(source_loader, target_train_loader, target_test_loader, model, optimizer, scheduler, args) File "main.py", line 158, in train test_acc, test_loss = test(model, target_test_loader, args) File "main.py", line 99, in test for data, target in target_test_loader: File "/home/kc501/anaconda3/envs/DA/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 517, in next data = self._next_data() File "/home/kc501/anaconda3/envs/DA/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1182, in _next_data idx, data = self._get_data() File "/home/kc501/anaconda3/envs/DA/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1148, in _get_data success, data = self._try_get_data() File "/home/kc501/anaconda3/envs/DA/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 986, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/queues.py", line 113, in get return _ForkingPickler.loads(res) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 282, in rebuild_storage_fd fd = df.detach() File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/resource_sharer.py", line 58, in detach return reduction.recv_handle(conn) File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/reduction.py", line 185, in recv_handle return recvfds(s, 1)[0] File "/home/kc501/anaconda3/envs/DA/lib/python3.7/multiprocessing/reduction.py", line 155, in recvfds raise EOFError EOFError

jindongwang commented 1 year ago

num of worker这个不属于本项目的问题,你可以自己百度解决。 target acc取最高的就可以。