Bug on evaluation during training

TengliEd commented 1 year ago

@tianrun-chen In func eval_psnr(),pred_list = torch.cat(pred_list, 1) gt_list = torch.cat(gt_list, 1) should be pred_list = torch.cat(pred_list, 0) gt_list = torch.cat(gt_list, 0)

hkxiao commented 1 year ago

I meet the same question

jiaweichaojwc commented 9 months ago

@TengliEd Hello, it seems like I've encountered some issues during the evaluation. Could you please tell me what this error message is? Has the following error occurred? aceback (most recent call last): File "train.py", line 279, in main(config, save_path, args=args) File "train.py", line 214, in main result1, result2, result3, result4, metric1, metric2, metric3, metric4 = eval_psnr(val_loader, model, File "train.py", line 89, in eval_psnr pred_list = torch.cat(pred_list, 1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 92 in the list. Traceback (most recent call last):
File "train.py", line 279, in main(config, save_path, args=args) File "train.py", line 214, in main result1, result2, result3, result4, metric1, metric2, metric3, metric4 = eval_psnr(val_loader, model, File "train.py", line 89, in eval_psnr pred_list = torch.cat(pred_list, 1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 92 in the list. Traceback (most recent call last): File "train.py", line 279, in main(config, save_path, args=args) File "train.py", line 214, in main result1, result2, result3, result4, metric1, metric2, metric3, metric4 = eval_psnr(val_loader, model, File "train.py", line 89, in eval_psnr pred_list = torch.cat(pred_list, 1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 92 in the list. Traceback (most recent call last): File "train.py", line 279, in main(config, save_path, args=args) File "train.py", line 214, in main result1, result2, result3, result4, metric1, metric2, metric3, metric4 = eval_psnr(val_loader, model, File "train.py", line 89, in eval_psnr pred_list = torch.cat(pred_list, 1) RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2 but got size 1 for tensor number 92 in the list. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 21242) of binary: /usr/local/miniconda3/bin/python Traceback (most recent call last): File "/usr/local/miniconda3/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/miniconda3/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError(

traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-11-12_14:41:17 host : I1637d69a3200801799 rank : 2 (local_rank: 2) exitcode : 1 (pid: 21244) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-11-12_14:41:17 host : I1637d69a3200801799 [2]: time : 2023-11-12_14:41:17 host : I1637d69a3200801799 rank : 2 (local_rank: 2) exitcode : 1 (pid: 21244) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-11-12_14:41:17 host : I1637d69a3200801799 rank : 3 (local_rank: 3) exitcode : 1 (pid: 21245) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-11-12_14:41:17 host : I1637d69a3200801799 rank : 0 (local_rank: 0) exitcode : 1 (pid: 21242) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Heroman2Space commented 8 months ago

I believe it should be torch.cat(pred_list, 0) instead of torch.cat(pred_list, 1)

JiahaoXia commented 2 hours ago

According to https://github.com/tianrun-chen/SAM-Adapter-PyTorch/blob/main/utils.py#L122

def calc_cod(y_pred, y_true):
    batchsize = y_true.shape[0]

    metric_FM = sod_metric.Fmeasure()
    metric_WFM = sod_metric.WeightedFmeasure()
    metric_SM = sod_metric.Smeasure()
    metric_EM = sod_metric.Emeasure()
    metric_MAE = sod_metric.MAE()
    with torch.no_grad():
        assert y_pred.shape == y_true.shape

        for i in range(batchsize):
            true, pred = \
                y_true[i, 0].cpu().data.numpy() * 255, y_pred[i, 0].cpu().data.numpy() * 255

The code iterate over batch_size, torch.cat(pred_list, 1) would only consider 1 batch samples. But the code means evaluate on all the validation set (len(val set)).

tianrun-chen / SAM-Adapter-PyTorch

Bug on evaluation during training #27