mapillary / inplace_abn

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs
BSD 3-Clause "New" or "Revised" License
1.32k stars 187 forks source link

Performance drop when replacing the old version of inplace with pytorch 1.0 version #84

Closed rotabulo closed 5 years ago

rotabulo commented 5 years ago

The performance of the above experiments (deeplab v3) dropped to 74% by using the new pytorch1.0 version of inplace. Is there anything need to be careful when replacing the old version of inplace with pytorch 1.0 version? I only copied the files in ./modules to ./libs in https://github.com/speedinghzl/pytorch-segmentation-toolbox

_Originally posted by @lzrobots in https://github.com/mapillary/inplace_abn/issues/15#issuecomment-458220361_

rotabulo commented 5 years ago

@lzrobots I created an ad-hoc issue.

rotabulo commented 5 years ago

@lzrobots do you see this gap when using the synchronized version or also the plain one?

lzrobots commented 5 years ago

Thanks. It's synchronized version (InPlaceABNSync). I didn't change anything in the above deeplab v3 toolbox except the libs folder. I am training it again.

rotabulo commented 5 years ago

@lzrobots One issue I see is the following. Our new layer is supposed to work with DistributedDataParallel. I suspect that in your case it is acting as if it was not synchronized, because the layer falls back to a world_size=1 (https://github.com/mapillary/inplace_abn/blob/master/modules/functions.py#L152).

lzrobots commented 5 years ago

Thanks. Revised the code and training it now.

rotabulo commented 5 years ago

@lzrobots did switching to DistributedDataParallel solve the discrepancy issue?

lzrobots commented 5 years ago

Got 77.5% on deeplab v3. Still have 1% lower than the old version but much better than non-distributed.

lzrobots commented 5 years ago

HI, when you do distribute training:

  1. for the batch size in the torch.utils.data.DataLoader, do you use the non-dis batch_size // world_size?
  2. don't you need to reduce loss and divide worl_size before loss.backward()?
  3. keep the learning rate same as non-distributed lr? Thanks
lzrobots commented 5 years ago
  1. does this make much difference? https://github.com/mapillary/inplace_abn/blob/master/train_imagenet.py#L204
rotabulo commented 5 years ago

HI, when you do distribute training:

  1. for the batch size in the torch.utils.data.DataLoader, do you use the non-dis batch_size // world_size?

In the distributed setting each Dataloader provides data for just 1 GPU, so the batchsize there should be the batch-size per gpu

  1. don't you need to reduce loss and divide worl_size before loss.backward()?

The reduction across gpus is done by DistributedDataParallel, so you don't need to take care of it

  1. keep the learning rate same as non-distributed lr?

Typically you do increase linearly the learning rate if you increase the batch-size (see, Training ImageNet in 1h paper)

Thanks

lzrobots commented 5 years ago

Ok. I reproduced the old version results with pytorch 1.0 version.

rotabulo commented 5 years ago

@lzrobots thanks for the confirmation. I am closing the issue.

lzrobots commented 5 years ago

I use 4 gpus for DistributedDataParallel training. Following is my nvidia-smi. There are three processes take 1217M each which lead to the unbalance memory usage. For example, memory usage in GPU3 is 3000+M more than GPU = 0, 1, 2 Is it a normal situation?

| GPU PID Type Process name GPU Memory Usage | 0 19421 C ...a3/envs/pytorch1.0_python3.7/bin/python 5155MiB | | 1 19422 C ...a3/envs/pytorch1.0_python3.7/bin/python 5151MiB | | 2 19423 C ...a3/envs/pytorch1.0_python3.7/bin/python 5157MiB | | 3 19420 C ...a3/envs/pytorch1.0_python3.7/bin/python 5797MiB | | 3 19421 C ...a3/envs/pytorch1.0_python3.7/bin/python 1217MiB | | 3 19422 C ...a3/envs/pytorch1.0_python3.7/bin/python 1217MiB | | 3 19423 C ...a3/envs/pytorch1.0_python3.7/bin/python 1217MiB

rotabulo commented 5 years ago

@lzrobots I guess this is not happening with the scripts we provide. Is this a script you wrote? Are you loading some pre-trained model? It looks like it is loaded in GPU3 by all processes and you never release that buffers.

lzrobots commented 5 years ago

Thanks.

denru01 commented 5 years ago

@lzrobots I encountered both of your issues: 1) slightly lower accuracy, 2) higher GPU memory usage for one GPU when I load a pre-trained model. Could you tell us how you solved them? Thanks!

lzrobots commented 5 years ago

@denru01 1) the performance drop is due to not using the DistributedDataParallel, 2) I didn't found the memory problem after changing to another segmentation code base with inplace module and DistributedDataParallel.

mingrui-xie commented 5 years ago

@lzrobots I met the same issue too. I use the project https://github.com/speedinghzl/pytorch-segmentation-toolbox, which is the same as yours, but I don’t know how to correctly change the Dataparallel to DistributeDataParallel. I have already try to use the DistributeDataParallel instead of Dataparallel, but the result is still lower than the old version bn. I don't know whether I wrongly used the DistributeDataParallel. Would you mind send me the code or tell me how to use it correctly? Thank you very much.

mingrui-xie commented 5 years ago

@lzrobots when I use the DistributeDataParallel, the result even lower than the DataParallel in toolbox pspnet.

here is the code I use, and I only modify the train.py:


import argparse
import torch
import torch.nn as nn
from torch.utils import data
import numpy as np
import pickle
import cv2
import torch.optim as optim
import scipy.misc
import torch.backends.cudnn as cudnn
import sys
import os
from tqdm import tqdm
import os.path as osp
from networks.pspnet import Res_Deeplab
from dataset.datasets import CSDataSet
import random
import timeit
import logging
from tensorboardX import SummaryWriter
from utils.utils import decode_labels, inv_preprocess, decode_predictions
from utils.criterion import CriterionDSN, CriterionOhemDSN
from utils.encoding import DataParallelModel, DataParallelCriterion
import torch.distributed as dist
import torch.nn.parallel
import torch.utils.data.distributed

torch_ver = torch.__version__[:3]
if torch_ver == '0.3':
    from torch.autograd import Variable

start = timeit.default_timer()

IMG_MEAN = np.array((104.00698793,116.66876762,122.67891434), dtype=np.float32)

BATCH_SIZE = 8
DATA_DIRECTORY = 'cityscapes'
DATA_LIST_PATH = './dataset/list/cityscapes/train.lst'
IGNORE_LABEL = 255
INPUT_SIZE = '769,769'
LEARNING_RATE = 1e-2
MOMENTUM = 0.9
NUM_CLASSES = 19
NUM_STEPS = 40000
POWER = 0.9
RANDOM_SEED = 1234
RESTORE_FROM = '../resnet101-imagenet.pth'
SAVE_NUM_IMAGES = 2
SAVE_PRED_EVERY = 1000
SNAPSHOT_DIR = 'debug'
WEIGHT_DECAY = 0.0005
GPU='4,5,6,7'

def str2bool(v):
    if v.lower() in ('yes', 'true', 't', 'y', '1'):
        return True
    elif v.lower() in ('no', 'false', 'f', 'n', '0'):
        return False
    else:
        raise argparse.ArgumentTypeError('Boolean value expected.')

def get_arguments():
    """Parse all the arguments provided from the CLI.
    Returns:
      A list of parsed arguments.
    """
    parser = argparse.ArgumentParser(description="DeepLab-ResNet Network")
    parser.add_argument("--batch-size", type=int, default=BATCH_SIZE,
                        help="Number of images sent to the network in one step.")
    parser.add_argument("--data-dir", type=str, default=DATA_DIRECTORY,
                        help="Path to the directory containing the PASCAL VOC dataset.")
    parser.add_argument("--data-list", type=str, default=DATA_LIST_PATH,
                        help="Path to the file listing the images in the dataset.")
    parser.add_argument("--ignore-label", type=int, default=IGNORE_LABEL,
                        help="The index of the label to ignore during the training.")
    parser.add_argument("--input-size", type=str, default=INPUT_SIZE,
                        help="Comma-separated string with height and width of images.")
    parser.add_argument("--is-training", action="store_true",
                        help="Whether to updates the running means and variances during the training.")
    parser.add_argument("--learning-rate", type=float, default=LEARNING_RATE,
                        help="Base learning rate for training with polynomial decay.")
    parser.add_argument("--momentum", type=float, default=MOMENTUM,
                        help="Momentum component of the optimiser.")
    parser.add_argument("--not-restore-last", action="store_true",
                        help="Whether to not restore last (FC) layers.")
    parser.add_argument("--num-classes", type=int, default=NUM_CLASSES,
                        help="Number of classes to predict (including background).")
    parser.add_argument("--start-iters", type=int, default=0,
                        help="Number of classes to predict (including background).")
    parser.add_argument("--num-steps", type=int, default=NUM_STEPS,
                        help="Number of training steps.")
    parser.add_argument("--power", type=float, default=POWER,
                        help="Decay parameter to compute the learning rate.")
    parser.add_argument("--random-mirror", action="store_true",
                        help="Whether to randomly mirror the inputs during the training.")
    parser.add_argument("--random-scale", action="store_true",
                        help="Whether to randomly scale the inputs during the training.")
    parser.add_argument("--random-seed", type=int, default=RANDOM_SEED,
                        help="Random seed to have reproducible results.")
    parser.add_argument("--restore-from", type=str, default=RESTORE_FROM,
                        help="Where restore model parameters from.")
    parser.add_argument("--save-num-images", type=int, default=SAVE_NUM_IMAGES,
                        help="How many images to save.")
    parser.add_argument("--save-pred-every", type=int, default=SAVE_PRED_EVERY,
                        help="Save summaries and checkpoint every often.")
    parser.add_argument("--snapshot-dir", type=str, default=SNAPSHOT_DIR,
                        help="Where to save snapshots of the model.")
    parser.add_argument("--weight-decay", type=float, default=WEIGHT_DECAY,
                        help="Regularisation parameter for L2-loss.")
    parser.add_argument("--gpu", type=str, default=GPU,
                        help="choose gpu device.")
    parser.add_argument("--recurrence", type=int, default=1,
                        help="choose the number of recurrence.")
    parser.add_argument("--ft", type=bool, default=False,
                        help="fine-tune the model with large input size.")

    parser.add_argument("--ohem", type=str2bool, default='False',
                        help="use hard negative mining")
    parser.add_argument("--ohem-thres", type=float, default=0.6,
                        help="choose the samples with correct probability underthe threshold.")
    parser.add_argument("--ohem-keep", type=int, default=200000,
                        help="choose the samples with correct probability underthe threshold.")
    parser.add_argument('--local_rank', default=0, type=int,
                        help='process rank on node')
    parser.add_argument('--dist-backend', default='nccl', type=str,
                        help='distributed backend')
    return parser.parse_args()

args = get_arguments()

def lr_poly(base_lr, iter, max_iter, power):
    return base_lr*((1-float(iter)/max_iter)**(power))

def adjust_learning_rate(optimizer, i_iter):
    """Sets the learning rate to the initial LR divided by 5 at 60th, 120th and 160th epochs"""
    lr = lr_poly(args.learning_rate, i_iter, args.num_steps, args.power)
    optimizer.param_groups[0]['lr'] = lr
    return lr

def set_bn_eval(m):
    classname = m.__class__.__name__
    if classname.find('BatchNorm') != -1:
        m.eval()

def set_bn_momentum(m):
    classname = m.__class__.__name__
    if classname.find('BatchNorm') != -1 or classname.find('InPlaceABN') != -1:
        m.momentum = 0.0003

def main():
    """Create the model and start the training."""
    writer = SummaryWriter(args.snapshot_dir)

    if not args.gpu == 'None':
        os.environ["CUDA_VISIBLE_DEVICES"]=args.gpu
    h, w = map(int, args.input_size.split(','))
    input_size = (h, w)

    cudnn.enabled = True

    # Create network.
    deeplab = Res_Deeplab(num_classes=args.num_classes)
    print(deeplab)

    saved_state_dict = torch.load(args.restore_from)
    new_params = deeplab.state_dict().copy()
    for i in saved_state_dict:
        #Scale.layer5.conv2d_list.3.weight
        i_parts = i.split('.')
        # print i_parts
        # if not i_parts[1]=='layer5':
        if not i_parts[0]=='fc':
            new_params['.'.join(i_parts[0:])] = saved_state_dict[i] 

    deeplab.load_state_dict(new_params)

    torch.cuda.set_device(args.local_rank)
    dist.init_process_group(backend=args.dist_backend, init_method='env://')
    rank = dist.get_rank()
    world_size = int(os.environ['WORLD_SIZE'])
    deeplab.cuda()
    model = torch.nn.parallel.DistributedDataParallel(deeplab, device_ids=[args.local_rank],
                                                          output_device=args.local_rank)

    #model = nn.parallel.DataParallel(deeplab)
    model.train()
    model.float()
    # model.apply(set_bn_momentum)
    model.cuda()    

    if args.ohem:
        criterion = CriterionOhemDSN(thresh=args.ohem_thres, min_kept=args.ohem_keep)
    else:
        criterion = CriterionDSN() #CriterionCrossEntropy()
    #criterion = DataParallelCriterion(criterion)
    criterion.cuda()

    cudnn.benchmark = True

    if not os.path.exists(args.snapshot_dir):
        os.makedirs(args.snapshot_dir)
    cs_dataset = CSDataSet(args.data_dir, args.data_list, max_iters=args.num_steps * args.batch_size, crop_size=input_size,
              scale=args.random_scale, mirror=args.random_mirror, mean=IMG_MEAN)
    train_sampler = torch.utils.data.distributed.DistributedSampler(cs_dataset)
    trainloader = data.DataLoader(cs_dataset,
                    batch_size=args.batch_size // world_size, shuffle=(train_sampler is None), num_workers=4, pin_memory=True, sampler=train_sampler)

    optimizer = optim.SGD([{'params': filter(lambda p: p.requires_grad, deeplab.parameters()), 'lr': args.learning_rate }], 
                lr=args.learning_rate, momentum=args.momentum,weight_decay=args.weight_decay)
    optimizer.zero_grad()

    interp = nn.Upsample(size=input_size, mode='bilinear', align_corners=True)

    for i_iter, batch in enumerate(trainloader):
        train_sampler.set_epoch(i_iter)
        i_iter += args.start_iters
        images, labels, _, _ = batch
        images = images.cuda()
        labels = labels.long().cuda()
        if torch_ver == "0.3":
            images = Variable(images)
            labels = Variable(labels)

        optimizer.zero_grad()
        lr = adjust_learning_rate(optimizer, i_iter)
        preds = model(images)

        loss = criterion(preds, labels)
        loss.backward()
        optimizer.step()

        if dist.get_rank() == 0:
            if i_iter % 100 == 0:
                writer.add_scalar('learning_rate', lr, i_iter)
                writer.add_scalar('loss', loss.data.cpu().numpy(), i_iter)
            if i_iter % 10 == 0:
                writer.add_scalar('loss10', loss.data.cpu().numpy(), i_iter)

        # if i_iter % 5000 == 0:
        #     images_inv = inv_preprocess(images, args.save_num_images, IMG_MEAN)
        #     labels_colors = decode_labels(labels, args.save_num_images, args.num_classes)
        #     if isinstance(preds, list):
        #         preds = preds[0]
        #     preds_colors = decode_predictions(preds, args.save_num_images, args.num_classes)
        #     for index, (img, lab) in enumerate(zip(images_inv, labels_colors)):
        #         writer.add_image('Images/'+str(index), img, i_iter)
        #         writer.add_image('Labels/'+str(index), lab, i_iter)
        #         writer.add_image('preds/'+str(index), preds_colors[index], i_iter)

        print('iter = {} of {} completed, loss = {}'.format(i_iter, args.num_steps, loss.data.cpu().numpy()))

        if rank == 0:
            if i_iter >= args.num_steps-1:
                print('save model ...')
                torch.save(deeplab.state_dict(),osp.join(args.snapshot_dir, 'CS_scenes_'+str(args.num_steps)+'.pth'))
                break

            if i_iter % args.save_pred_every == 0:
                print('taking snapshot ...')
                torch.save(deeplab.state_dict(),osp.join(args.snapshot_dir, 'CS_scenes_'+str(i_iter)+'.pth'))

    end = timeit.default_timer()
    print(end-start,'seconds')

if __name__ == '__main__':
    main()
lzrobots commented 5 years ago

@denru01 @asfavdfqefc How much did you get on cityscapes val? Took a look the code above and it's almost same as mine except I use shuffle=False since train_sampler already take care of the sampling. Correct me if I am wrong. set same learning rate as before.

mingrui-xie commented 5 years ago

@lzrobots I got 75.3% mIoU in cityscapes val set(it should be 78.3!), and the parameters I set , such as learning rate,weight decay and so on , are all obey the toolbox default setting. and I also use shuffle=False too,

 trainloader = data.DataLoader(cs_dataset,
                    batch_size=args.batch_size // world_size, shuffle=(train_sampler is None), num_workers=4, pin_memory=True, sampler=train_sampler)

here the shuffle=False because the sampler is not None.

So, Is it all the same? that's really strange. Would you mind send me your code please?

and I also have some doubt about loss. when training the model(on 4GPUs ), the output is like:

iter = 10032 of 40000 completed, loss = 0.2960323095321655
iter = 10032 of 40000 completed, loss = 0.321418434381485
iter = 10032 of 40000 completed, loss = 0.268339604139328
iter = 10032 of 40000 completed, loss = 0.22530245780944824
iter = 10033 of 40000 completed, loss = 0.24178841710090637
iter = 10033 of 40000 completed, loss = 0.7654654383659363
iter = 10033 of 40000 completed, loss = 0.24938637018203735
iter = 10033 of 40000 completed, loss = 0.25773030519485474
iter = 10044 of 40000 completed, loss = 0.20511074364185333
iter = 10044 of 40000 completed, loss = 0.1646052747964859
iter = 10044 of 40000 completed, loss = 0.19308854639530182
iter = 10044 of 40000 completed, loss = 0.6363773345947266
iter = 10045 of 40000 completed, loss = 0.18887627124786377
iter = 10045 of 40000 completed, loss = 0.15226246416568756
iter = 10045 of 40000 completed, loss = 0.1867300570011139
iter = 10045 of 40000 completed, loss = 0.14707821607589722

There are four processes, and every process output a loss which is different from one another, and I'm not sure wether the loss will converge to compute the final loss.

In conclusion: 1.Is the output I show above is the same as yours? 2.Did you delete this code?

criterion = DataParallelCriterion(criterion)

3.I use this script to train the model, Is it right?

python -m torch.distributed.launch --nproc_per_node 4 train.py

4.Could you send me your code? Thank you very much!