batchsize influence test performance

pepperbubble commented 1 month ago

when i change the batchsize on test,the dice score is different

from medpy import metric
    model.eval()
    metric_list = []
    metric_list_dice = []
    metric_list_hd = []
    y_pred = []
    y_true = []
    with torch.no_grad():
        for sample in tqdm(dataloader):
            images, true_masks, name = sample['image'].cuda(), sample['mask'].numpy(), sample['name']
            masks_pred = model(images)

            if model.n_classes == 1:
                masks_pred = F.sigmoid(masks_pred).cpu().numpy().squeeze(1)
                masks_pred = np.where( masks_pred > args.mask_threshold, 1, 0)

            dice = metric.binary.dc(masks_pred, true_masks)
            hd95 = metric.binary.hd95(masks_pred, true_masks)

            metric_list_dice.append(dice)
            metric_list_hd.append(hd95)

        logging.info('Mean Dice: %f' % np.mean(metric_list_dice))
        logging.info('Mean HD95: %f' % np.mean(metric_list_hd))

when bs=1,INFO: Mean Dice: 0.868615 INFO: Mean HD95: 10.437061 bs = 2, INFO: Mean Dice: 0.872097 INFO: Mean HD95: 9.246552 bs = 64, INFO: Mean Dice: 0.878093,INFO: Mean HD95: 2.692308

loli commented 1 month ago

Hey @pepperbubble, thanks for the report.

Looking at you code, I see that you pass multiple result and true masks at once to the metric functions.

The metric functions are not made to deal with batch tensors. Instead, as you use them, they interpret the input as a 4D image (the original 3D image plus an additional dimension from the batches). That will lead to different results for different batch sizes.

You can do:

for mask_pred, true_mask in zip(masks_pred, true_masks):
  dice = metric.binary.dc(mask_pred, true_mask)
  metric_list_dice.append(dice)

pepperbubble commented 1 month ago

thanks! it really helps.

loli / medpy

batchsize influence test performance #132