vikrant7 / mobile-vod-bottleneck-lstm

Implementation of Mobile Video Object Detection with Temporally-Aware Feature Maps using PyTorch
131 stars 37 forks source link

Cannot understand the forward part of bottleneckLSTM #8

Open Mindbooom opened 5 years ago

Mindbooom commented 5 years ago

Hi,I notice that in the paper the forward part is this image But in your code, this part is image Why you plus c*self.wci and self.wcf in the code putting the ct-1 into the functions? And you involve the cc into the calculation of co which is also different from the paper. What is that meaning? image Thank you very much!

Mindbooom commented 5 years ago

Hi,Now I have another question. I try to use 4 GPU to train the lstm1 by insert the code: net = torch.nn.DataParallel(net, device_ids=[0, 1, 2, 3]) But in the result, the input image can be divided to different GPU, but the hidden state-h and the cell state-c cannot be divided to gpu1,2and3. I cannot find a resolution for this, do you have any advise ?

vikrant7 commented 5 years ago

Thanks @Mindbooom for pointing out the difference in ConvLSTM definition used by me. Actually the ConvLSTM which was used by me was from some papers on ConvLSTM. Now I have updated the ConvLSTM layer according to this paper definition.

vikrant7 commented 5 years ago

For multiple GPU training, I will update the repo after 23rd November as currently quite occupied in other kinds of stuff. In the meantime, you can have look at asynchronous gradient decent training for multiple GPUs which is used in this paper and try to implement it.

Mindbooom commented 5 years ago

Hi! @vikrant7 ,I have write a code for multiple GPU training by changing how to init and change the h and c. But It's training is extremely slow and I don't know why. I 'll share the code here. '''mvod_bottleneck_lstm1_multigpu.py'''

#!/usr/bin/python3
"""Script for creating basenet with one Bottleneck LSTM layer after conv 13.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from typing import List, Tuple
from utils import box_utils
from collections import namedtuple
from collections import OrderedDict
from torch.autograd import Variable
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import numpy as np
import logging

def SeperableConv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0):
    """Replace Conv2d with a depthwise Conv2d and Pointwise Conv2d.
    Arguments:
        in_channels : number of channels of input
        out_channels : number of channels of output
        kernel_size : kernel size for depthwise convolution
        stride : stride for depthwise convolution
        padding : padding for depthwise convolution
    Returns:
        object of class torch.nn.Sequential
    """
    return nn.Sequential(
        nn.Conv2d(in_channels=int(in_channels), out_channels=int(in_channels), kernel_size=kernel_size,
                  groups=int(in_channels), stride=stride, padding=padding),
        nn.ReLU6(),
        nn.Conv2d(in_channels=int(in_channels), out_channels=int(out_channels), kernel_size=1),
    )

def conv_bn(inp, oup, stride):
    """3x3 conv with batchnorm and relu
    Arguments:
        inp : number of channels of input
        oup : number of channels of output
        stride : stride for depthwise convolution
    Returns:
        object of class torch.nn.Sequential
    """
    return nn.Sequential(
        nn.Conv2d(int(inp), int(oup), 3, stride, 1, bias=False),
        nn.BatchNorm2d(int(oup)),
        nn.ReLU6(inplace=True)
    )

def conv_dw(inp, oup, stride):
    """Replace Conv2d with a depthwise Conv2d and Pointwise Conv2d having batchnorm and relu layers in between.
    Here kernel size is fixed at 3.
    Arguments:
        inp : number of channels of input
        oup : number of channels of output
        stride : stride for depthwise convolution
    Returns:
        object of class torch.nn.Sequential
    """
    return nn.Sequential(
        nn.Conv2d(int(inp), int(inp), 3, stride, 1, groups=int(inp), bias=False),
        nn.BatchNorm2d(int(inp)),
        nn.ReLU6(inplace=True),

        nn.Conv2d(int(inp), int(oup), 1, 1, 0, bias=False),
        nn.BatchNorm2d(int(oup)),
        nn.ReLU6(inplace=True),
    )

class MatchPrior(object):
    """Matches priors based on the SSD prior config
    Arguments:
        center_form_priors : priors generated based on specs and image size in config file
        center_variance : a float used to change the scale of center
        size_variance : a float used to change the scale of size
        iou_threshold : a float value of thresholf of IOU
    """

    def __init__(self, center_form_priors, center_variance, size_variance, iou_threshold):
        self.center_form_priors = center_form_priors
        self.corner_form_priors = box_utils.center_form_to_corner_form(center_form_priors)
        self.center_variance = center_variance
        self.size_variance = size_variance
        self.iou_threshold = iou_threshold

    def __call__(self, gt_boxes, gt_labels):
        """
        Arguments:
            gt_boxes : ground truth boxes
            gt_labels : ground truth labels
        Returns:
            locations of form (batch_size, num_priors, 4) and labels
        """
        if type(gt_boxes) is np.ndarray:
            gt_boxes = torch.from_numpy(gt_boxes)
        if type(gt_labels) is np.ndarray:
            gt_labels = torch.from_numpy(gt_labels)
        boxes, labels = box_utils.assign_priors(gt_boxes, gt_labels,
                                                self.corner_form_priors, self.iou_threshold)
        boxes = box_utils.corner_form_to_center_form(boxes)
        locations = box_utils.convert_boxes_to_locations(boxes, self.center_form_priors, self.center_variance,
                                                         self.size_variance)
        return locations, labels

'''
class BottleneckLSTMCell(nn.Module):
    """ Creates a LSTM layer cell
    Arguments:
        input_channels : variable used to contain value of number of channels in input
        hidden_channels : variable used to contain value of number of channels in the hidden state of LSTM cell
    """

    def __init__(self, input_channels, hidden_channels):
        super(BottleneckLSTMCell, self).__init__()

        assert hidden_channels % 2 == 0

        self.input_channels = int(input_channels)
        self.hidden_channels = int(hidden_channels)
        self.num_features = 4
        self.W = nn.Conv2d(in_channels=self.input_channels, out_channels=self.input_channels, kernel_size=3,
                           groups=self.input_channels, stride=1, padding=1)
        self.Wy = nn.Conv2d(int(self.input_channels + self.hidden_channels), self.hidden_channels, kernel_size=1)
        self.Wi = nn.Conv2d(self.hidden_channels, self.hidden_channels, 3, 1, 1, groups=self.hidden_channels,
                            bias=False)
        self.Wbi = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
        self.Wbf = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
        self.Wbc = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
        self.Wbo = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
        self.relu = nn.ReLU6()
        logging.info("Initializing weights of lstm")
        self._initialize_weights()

    def _initialize_weights(self):
        """
        Returns:
            initialized weights of the model
        """
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def forward(self, x, h,
                c):  # implemented as mentioned in paper here the only difference is  Wbi, Wbf, Wbc & Wbo are commuted all together in paper
        """
        Arguments:
            x : input tensor
            h : hidden state tensor
            c : cell state tensor
        Returns:
            output tensor after LSTM cell
        """
        print('The size of x is', x.size())
        print('The size of h is', h.size())
        print('The size of c is', c.size())
        x = self.W(x)
        y = torch.cat((x, h), 1)  # concatenate input and hidden layers
        i = self.Wy(y)  # reduce to hidden layer size,the bottleneck
        b = self.Wi(i)  # depth wise 3*3 need a pointwise
        ci = torch.sigmoid(self.Wbi(b))
        cf = torch.sigmoid(self.Wbf(b))
        print('The device of cf is',cf.device)
        #print('The device of c is',cc.device)
        print('The device of ci is',ci.device)
        print('The device of b is',b.device)
        print('The device of x is',x.device)
        print('The device of y is',y.device)
        print('The device of i is',i.device)
        print('The device of h is',h.device)
        print('The device of c is',c.device)
        cc = cf * c + ci * self.relu(self.Wbc(b))
        co = torch.sigmoid(self.Wbo(b))
        ch = co * self.relu(cc)
        # print('Wci is ',self.Wci)
        # print('Wcf is ', self.Wcf)
        # print('Wco is ', self.Wco)
        return ch, cc

    def init_hidden(self, batch_size, hidden, shape):
        """
        Arguments:
            batch_size : an int variable having value of batch size while training
            hidden : an int variable having value of number of channels in hidden state
            shape : an array containing shape of the hidden and cell state
        Returns:
            cell state and hidden state
        """
        return (Variable(torch.zeros(batch_size, hidden, shape[0], shape[1])).cuda(),
                Variable(torch.zeros(batch_size, hidden, shape[0], shape[1])).cuda()
                )

class BottleneckLSTM(nn.Module):
    def __init__(self, input_channels, hidden_channels, height, width, batch_size):
        """ Creates Bottleneck LSTM layer
        Arguments:
            input_channels : variable having value of number of channels of input to this layer
            hidden_channels : variable having value of number of channels of hidden state of this layer
            height : an int variable having value of height of the input
            width : an int variable having value of width of the input
            batch_size : an int variable having value of batch_size of the input
        Returns:
            Output tensor of LSTM layer
        """
        super(BottleneckLSTM, self).__init__()
        self.input_channels = int(input_channels)
        self.hidden_channels = int(hidden_channels)
        self.cell = BottleneckLSTMCell(self.input_channels, self.hidden_channels)
        (h, c) = self.cell.init_hidden(batch_size, hidden=self.hidden_channels, shape=(height, width))
        self.hidden_state = h
        self.cell_state = c

    def forward(self, input):
        new_h, new_c = self.cell(input, self.hidden_state, self.cell_state)
        self.hidden_state = new_h
        self.cell_state = new_c
        return self.hidden_state
'''

class BottleneckLSTM(nn.Module):
    def __init__(self, input_channels, hidden_channels, height, width, batch_size):
        """ Creates Bottleneck LSTM layer
        Arguments:
            input_channels : variable having value of number of channels of input to this layer
            hidden_channels : variable having value of number of channels of hidden state of this layer
            height : an int variable having value of height of the input
            width : an int variable having value of width of the input
            batch_size : an int variable having value of batch_size of the input
        Returns:
            Output tensor of LSTM layer
        """
        super(BottleneckLSTM, self).__init__()
        self.input_channels = int(input_channels)
        self.hidden_channels = int(hidden_channels)
        self.batch_size = batch_size
        self.shape = (height,width)
        self.num_features = 4
        self.W = nn.Conv2d(in_channels=self.input_channels, out_channels=self.input_channels, kernel_size=3,
                           groups=self.input_channels, stride=1, padding=1)
        self.Wy = nn.Conv2d(int(self.input_channels + self.hidden_channels), self.hidden_channels, kernel_size=1)
        self.Wi = nn.Conv2d(self.hidden_channels, self.hidden_channels, 3, 1, 1, groups=self.hidden_channels,
                            bias=False)
        self.Wbi = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
        self.Wbf = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
        self.Wbc = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
        self.Wbo = nn.Conv2d(self.hidden_channels, self.hidden_channels, 1, 1, 0, bias=False)
        self.relu = nn.ReLU6()
        logging.info("Initializing weights of lstm")
        self._initialize_weights()
        #self.cell = self.int_BottleneckLSTMCell(self.input_channels, self.hidden_channels)
        #(h, c) = self.cell.init_hidden(batch_size, hidden=self.hidden_channels, shape=(height, width))
        #self.hidden_state = h
        #self.cell_state = c

    def forward(self, x, h,
                c):  # implemented as mentioned in paper here the only difference is  Wbi, Wbf, Wbc & Wbo are commuted all together in paper
        """
        Arguments:
            x : input tensor
            h : hidden state tensor
            c : cell state tensor
        Returns:
            output tensor after LSTM cell
        """
        # print('The size of x is', x.size())
        # print('The size of h is', h.size())
        # print('The size of c is', c.size())
        x = self.W(x)
        y = torch.cat((x, h), 1)  # concatenate input and hidden layers
        i = self.Wy(y)  # reduce to hidden layer size,the bottleneck
        b = self.Wi(i)  # depth wise 3*3 need a pointwise
        ci = torch.sigmoid(self.Wbi(b))
        cf = torch.sigmoid(self.Wbf(b))
        cc = cf * c + ci * self.relu(self.Wbc(b))
        co = torch.sigmoid(self.Wbo(b))
        ch = co * self.relu(cc)
        # print('Wci is ',self.Wci)
        # print('Wcf is ', self.Wcf)
        # print('Wco is ', self.Wco)
        return ch ,ch ,cc
    def _initialize_weights(self):
        """
        Returns:
            initialized weights of the model
        """
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
    def init_hidden(self):
        """
        Arguments:
            batch_size : an int variable having value of batch size while training
            hidden : an int variable having value of number of channels in hidden state
            shape : an array containing shape of the hidden and cell state
        Returns:
            cell state and hidden state
        """

        return (Variable(torch.zeros(self.batch_size, self.hidden_channels, self.shape[0], self.shape[1])).cuda(),
                Variable(torch.zeros(self.batch_size, self.hidden_channels, self.shape[0], self.shape[1])).cuda()
                )

def crop_like(x, target):
    """
    Arguments:
        x : a tensor whose shape has to be cropped
        target : a tensor whose shape has to assert on x
    Returns:
        x having same shape as target
    """
    if x.size()[2:] == target.size()[2:]:
        return x
    else:
        height = target.size()[2]
        width = target.size()[3]
        crop_h = torch.FloatTensor([x.size()[2]]).sub(height).div(-2)
        crop_w = torch.FloatTensor([x.size()[3]]).sub(width).div(-2)
    # fixed indexing for PyTorch 0.4
    return F.pad(x, [int(crop_w.ceil()[0]), int(crop_w.floor()[0]), int(crop_h.ceil()[0]), int(crop_h.floor()[0])])

class MobileNetV1(nn.Module):
    def __init__(self, num_classes=1024, alpha=1):
        """torch.nn.module for mobilenetv1 upto conv12
        Arguments:
            num_classes : an int variable having value of total number of classes
            alpha : a float used as width multiplier for channels of model
        """
        super(MobileNetV1, self).__init__()
        # upto conv 12
        self.model = nn.Sequential(
            conv_bn(3, 32 * alpha, 2),
            conv_dw(32 * alpha, 64 * alpha, 1),
            conv_dw(64 * alpha, 128 * alpha, 2),
            conv_dw(128 * alpha, 128 * alpha, 1),
            conv_dw(128 * alpha, 256 * alpha, 2),
            conv_dw(256 * alpha, 256 * alpha, 1),
            conv_dw(256 * alpha, 512 * alpha, 2),
            conv_dw(512 * alpha, 512 * alpha, 1),
            conv_dw(512 * alpha, 512 * alpha, 1),
            conv_dw(512 * alpha, 512 * alpha, 1),
            conv_dw(512 * alpha, 512 * alpha, 1),
            conv_dw(512 * alpha, 512 * alpha, 1),
        )
        logging.info("Initializing weights of base net")
        self._initialize_weights()

    # self.fc = nn.Linear(1024, num_classes)
    def _initialize_weights(self):
        """
        Returns:
            initialized weights of the model
        """
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def forward(self, x):
        """
        Arguments:
            x : a tensor which is used as input for the model
        Returns:
            a tensor which is output of the model
        """
        x = self.model(x)
        return x

class SSD(nn.Module):
    def __init__(self, num_classes, batch_size, alpha=1, is_test=False, config=None, device=None):
        """
        Arguments:
            num_classes : an int variable having value of total number of classes
            batch_size : an int variable having value of batch size
            alpha : a float used as width multiplier for channels of model
            is_Test : a bool used to make model ready for testing
            config : a dict containing all the configuration parameters
        """
        super(SSD, self).__init__()
        # Decoder
        self.is_test = is_test
        self.config = config
        self.num_classes = num_classes
        if device:
            self.device = device
        else:
            self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
        if is_test:
            self.config = config
            self.priors = config.priors.to(self.device)
        self.conv13 = conv_dw(512 * alpha, 1024 * alpha, 2)  # not using conv14 as mentioned in paper
        self.bottleneck_lstm1 = BottleneckLSTM(input_channels=1024 * alpha, hidden_channels=256 * alpha, height=10,
                                               width=10, batch_size=batch_size)

        self.fmaps_1 = nn.Sequential(
            nn.Conv2d(in_channels=int(256 * alpha), out_channels=int(128 * alpha), kernel_size=1),
            nn.ReLU6(inplace=True),
            SeperableConv2d(in_channels=128 * alpha, out_channels=256 * alpha, kernel_size=3, stride=2, padding=1),
        )
        self.fmaps_2 = nn.Sequential(
            nn.Conv2d(in_channels=int(256 * alpha), out_channels=int(64 * alpha), kernel_size=1),
            nn.ReLU6(inplace=True),
            SeperableConv2d(in_channels=64 * alpha, out_channels=128 * alpha, kernel_size=3, stride=2, padding=1),
        )
        self.fmaps_3 = nn.Sequential(
            nn.Conv2d(in_channels=int(128 * alpha), out_channels=int(64 * alpha), kernel_size=1),
            nn.ReLU6(inplace=True),
            SeperableConv2d(in_channels=64 * alpha, out_channels=128 * alpha, kernel_size=3, stride=2, padding=1),
        )
        self.fmaps_4 = nn.Sequential(
            nn.Conv2d(in_channels=int(128 * alpha), out_channels=int(32 * alpha), kernel_size=1),
            nn.ReLU6(inplace=True),
            SeperableConv2d(in_channels=32 * alpha, out_channels=64 * alpha, kernel_size=3, stride=2, padding=1),
        )
        self.regression_headers = nn.ModuleList([
            SeperableConv2d(in_channels=512 * alpha, out_channels=6 * 4, kernel_size=3, padding=1),
            SeperableConv2d(in_channels=256 * alpha, out_channels=6 * 4, kernel_size=3, padding=1),
            SeperableConv2d(in_channels=256 * alpha, out_channels=6 * 4, kernel_size=3, padding=1),
            SeperableConv2d(in_channels=128 * alpha, out_channels=6 * 4, kernel_size=3, padding=1),
            SeperableConv2d(in_channels=128 * alpha, out_channels=6 * 4, kernel_size=3, padding=1),
            nn.Conv2d(in_channels=int(64 * alpha), out_channels=6 * 4, kernel_size=1),
        ])

        self.classification_headers = nn.ModuleList([
            SeperableConv2d(in_channels=512 * alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
            SeperableConv2d(in_channels=256 * alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
            SeperableConv2d(in_channels=256 * alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
            SeperableConv2d(in_channels=128 * alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
            SeperableConv2d(in_channels=128 * alpha, out_channels=6 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(in_channels=int(64 * alpha), out_channels=6 * num_classes, kernel_size=1),
        ])

        logging.info("Initializing weights of SSD")
        self._initialize_weights()

    def _initialize_weights(self):
        """
        Returns:
            initialized weights of the model
        """
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.xavier_uniform_(m.weight)
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()

    def compute_header(self, i, x):  # ssd method to calculate headers
        """
        Arguments:
            i : an int used to use particular classification and regression layer
            x : a tensor used as input to layers
        Returns:
            locations and confidences of the predictions
        """
        confidence = self.classification_headers[i](x)
        confidence = confidence.permute(0, 2, 3, 1).contiguous()
        confidence = confidence.view(confidence.size(0), -1, self.num_classes)

        location = self.regression_headers[i](x)
        location = location.permute(0, 2, 3, 1).contiguous()
        location = location.view(location.size(0), -1, 4)

        return confidence, location

    def forward(self, x,h,c):
        """
        Arguments:
            x : a tensor which is used as input for the model
        Returns:
            confidences and locations of predictions made by model during training
            or
            confidences and boxes of predictions made by model during testing
        """
        confidences = []
        locations = []
        header_index = 0
        confidence, location = self.compute_header(header_index, x)
        header_index += 1
        confidences.append(confidence)
        locations.append(location)
        x = self.conv13(x)
        #x = self.bottleneck_lstm1(x)
        #h, c = self.bottleneck_lstm1.init_hidden()

        x,h,c = self.bottleneck_lstm1(x,h,c)

        confidence, location = self.compute_header(header_index, x)
        header_index += 1
        confidences.append(confidence)
        locations.append(location)
        x = self.fmaps_1(x)
        confidence, location = self.compute_header(header_index, x)
        header_index += 1
        confidences.append(confidence)
        locations.append(location)
        x = self.fmaps_2(x)
        confidence, location = self.compute_header(header_index, x)
        header_index += 1
        confidences.append(confidence)
        locations.append(location)
        x = self.fmaps_3(x)
        confidence, location = self.compute_header(header_index, x)
        header_index += 1
        confidences.append(confidence)
        locations.append(location)
        x = self.fmaps_4(x)
        confidence, location = self.compute_header(header_index, x)
        header_index += 1
        confidences.append(confidence)
        locations.append(location)
        confidences = torch.cat(confidences, 1)
        locations = torch.cat(locations, 1)

        if self.is_test:  # while testing convert locations to boxes
            confidences = F.softmax(confidences, dim=2)
            boxes = box_utils.convert_locations_to_boxes(
                locations, self.priors, self.config.center_variance, self.config.size_variance
            )
            boxes = box_utils.center_form_to_corner_form(boxes)
            return confidences, boxes,h,c
        else:
            return confidences, locations,h,c

class MobileVOD(nn.Module):
    """
        Module to join encoder and decoder of predictor model
    """

    def __init__(self, pred_enc, pred_dec):
        """
        Arguments:
            pred_enc : an object of MobilenetV1 class
            pred_dec : an object of SSD class
        """
        super(MobileVOD, self).__init__()
        self.pred_encoder = pred_enc
        self.pred_decoder = pred_dec

    def forward(self, seq,h,c):
        """
        Arguments:
            seq : a tensor used as input to the model
        Returns:
            confidences and locations of predictions made by model
        """
        x = self.pred_encoder(seq)
        confidences, locations ,h,c = self.pred_decoder(x,h,c)
        return confidences, locations,h,c

    def detach_hidden(self,h,c):
        """
        Detaches hidden state and cell state of all the LSTM layers from the graph
        """
        h.detach_()
        c.detach_()
Mindbooom commented 5 years ago

'''train_mvod_lstm1_multigpu.py'''

#!/usr/bin/python3
"""Script for training the MobileVOD with 1 Bottleneck Bottleneck LSTM layers. As in mobilenet, here we use depthwise seperable convolutions
for reducing the computation without affecting accuracy much. Model is trained on Imagenet VID 2015 dataset.
Here we unroll LSTM for 10 steps and gives 10 consecutive frames of video as input.
Few global variables defined here are explained:
Global Variables
----------------
args : dict
    Has all the options for changing various variables of the model as well as hyper-parameters for training.
dataset : VIDDataset (torch.utils.data.Dataset, For more info see datasets/vid_dataset.py)
optimizer : optim.RMSprop
scheduler : CosineAnnealingLR, MultiStepLR (torch.optim.lr_scheduler)
config : mobilenetv1_ssd_config (See config/mobilenetv1_ssd_config.py for more info, where you can change input size and ssd priors)
loss : MultiboxLoss (See network/multibox_loss.py for more info)
how to run: python train_mvod_lstm1_multigpu.py --datasets /home/ILSVRC2015 --cache_path=../cache --batch_size 10 --num_epochs 30 --pretrained ./models/basenet/WM-1.0-Epoch-3-Loss-5.234554548599229.pth --width_mult 1 --freeze_net

"""
import argparse
import os
import logging
import sys
import itertools

import torch
from torch.utils.data import DataLoader, ConcatDataset
from torch.optim.lr_scheduler import CosineAnnealingLR, MultiStepLR

from utils.misc import str2bool, Timer, store_labels
from network.mvod_bottleneck_lstm1_multigpu import MobileVOD, SSD, MobileNetV1, MatchPrior
from datasets.vid_dataset_new import VIDDataset
from network.multibox_loss import MultiboxLoss
from config import mobilenetv1_ssd_config
from dataloaders.data_preprocessing import TrainAugmentation, TestTransform

parser = argparse.ArgumentParser(
    description='Mobile Video Object Detection (Bottleneck LSTM) Training With Pytorch')

parser.add_argument('--datasets', help='Dataset directory path')
parser.add_argument('--cache_path', help='Cache directory path')
parser.add_argument('--freeze_net', action='store_true',
                    help="Freeze all the layers except the prediction head.")
parser.add_argument('--width_mult', default=1.0, type=float,
                    help='Width Multiplifier')

# Params for SGD
parser.add_argument('--lr', '--learning-rate', default=0.0003, type=float,
                    help='initial learning rate')
parser.add_argument('--momentum', default=0.9, type=float,
                    help='Momentum value for optim')
parser.add_argument('--weight_decay', default=5e-4, type=float,
                    help='Weight decay for SGD')
parser.add_argument('--gamma', default=0.1, type=float,
                    help='Gamma update for SGD')
parser.add_argument('--base_net_lr', default=None, type=float,
                    help='initial learning rate for base net.')
parser.add_argument('--ssd_lr', default=None, type=float,
                    help='initial learning rate for the layers not in base net and prediction heads.')

# Params for loading pretrained basenet or checkpoints.
parser.add_argument('--pretrained', help='Pre-trained model')
parser.add_argument('--resume', default=None, type=str,
                    help='Checkpoint state_dict file to resume training from')

# Scheduler
parser.add_argument('--scheduler', default="multi-step", type=str,
                    help="Scheduler for SGD. It can one of multi-step and cosine")

# Params for Multi-step Scheduler
parser.add_argument('--milestones', default="80,100", type=str,
                    help="milestones for MultiStepLR")

# Params for Cosine Annealing
parser.add_argument('--t_max', default=120, type=float,
                    help='T_max value for Cosine Annealing Scheduler.')

# Train params
parser.add_argument('--batch_size', default=1, type=int,
                    help='Batch size for training')
parser.add_argument('--num_epochs', default=200, type=int,
                    help='the number epochs')
parser.add_argument('--num_workers', default=4, type=int,
                    help='Number of workers used in dataloading')
parser.add_argument('--validation_epochs', default=1, type=int,
                    help='the number epochs')
parser.add_argument('--debug_steps', default=100, type=int,
                    help='Set the debug log output frequency.')
parser.add_argument('--sequence_length', default=10, type=int,
                    help='sequence_length of video to unfold')
parser.add_argument('--use_cuda', default=True, type=str2bool,
                    help='Use CUDA to train model')

parser.add_argument('--checkpoint_folder', default='models/',
                    help='Directory for saving checkpoint models')

logging.basicConfig(stream=sys.stdout, level=logging.INFO,
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
args = parser.parse_args()
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() and args.use_cuda else "cpu")

if args.use_cuda and torch.cuda.is_available():
    torch.backends.cudnn.benchmark = True
    logging.info("Use Cuda.")

def train(loader, net, criterion, optimizer, device, hidden_state,cell_state,
          debug_steps=100, epoch=-1, sequence_length=10,
          ):
    """ Train model
    Arguments:
        net : object of MobileVOD class
        loader : validation data loader object
        criterion : Loss function to use
        device : device on which computation is done
        optimizer : optimizer to optimize model
        debug_steps : number of steps after which model needs to debug
        sequence_length : unroll length of model
        epoch : current epoch number
    """
    net.train(True)
    running_loss = 0.0
    running_regression_loss = 0.0
    running_classification_loss = 0.0
    for i, data in enumerate(loader):
        images, boxes, labels = data
        for image, box, label in zip(images, boxes, labels):
            image = image.to(device)
            box = box.to(device)
            label = label.to(device)

            optimizer.zero_grad()
            confidence, locations,h,c = net(image,hidden_state,cell_state)
            regression_loss, classification_loss = criterion(confidence, locations, label, box)  # TODO CHANGE BOXES
            loss = regression_loss + classification_loss
            loss.backward(retain_graph=True)
            optimizer.step()

            running_loss += loss.item()
            running_regression_loss += regression_loss.item()
            running_classification_loss += classification_loss.item()
            hidden_state = h
            cell_state = c
        net.detach_hidden(hidden_state,cell_state)
        if i and i % debug_steps == 0:
            avg_loss = running_loss / (debug_steps * sequence_length)
            avg_reg_loss = running_regression_loss / (debug_steps * sequence_length)
            avg_clf_loss = running_classification_loss / (debug_steps * sequence_length)
            logging.info(
                f"Epoch: {epoch}, Step: {i}, " +
                f"Average Loss: {avg_loss:.4f}, " +
                f"Average Regression Loss {avg_reg_loss:.4f}, " +
                f"Average Classification Loss: {avg_clf_loss:.4f}"
            )
            running_loss = 0.0
            running_regression_loss = 0.0
            running_classification_loss = 0.0
    net.detach_hidden()

def val(loader, net, criterion, device):
    """ Validate model
    Arguments:
        net : object of MobileVOD class
        loader : validation data loader object
        criterion : Loss function to use
        device : device on which computation is done
    Returns:
        loss, regression loss, classification loss
    """
    net.eval()
    running_loss = 0.0
    running_regression_loss = 0.0
    running_classification_loss = 0.0
    num = 0
    for _, data in enumerate(loader):
        images, boxes, labels = data
        for image, box, label in zip(images, boxes, labels):
            image = image.to(device)
            box = box.to(device)
            label = label.to(device)
            num += 1

            with torch.no_grad():
                confidence, locations = net(image)
                regression_loss, classification_loss = criterion(confidence, locations, label, box)
                loss = regression_loss + classification_loss

            running_loss += loss.item()
            running_regression_loss += regression_loss.item()
            running_classification_loss += classification_loss.item()
        net.detach_hidden()
    return running_loss / num, running_regression_loss / num, running_classification_loss / num

def initialize_model(net):
    """ Loads learned weights from pretrained checkpoint model
    Arguments:
        net : object of MobileVOD
    """
    if args.pretrained:
        logging.info("Loading weights from pretrained netwok")
        pretrained_net_dict = torch.load(args.pretrained)
        model_dict = net.state_dict()
        # 1. filter out unnecessary keys
        pretrained_dict = {k: v for k, v in pretrained_net_dict.items() if
                           k in model_dict and model_dict[k].shape == pretrained_net_dict[k].shape}
        # 2. overwrite entries in the existing state dict
        model_dict.update(pretrained_dict)
        net.load_state_dict(model_dict)

if __name__ == '__main__':
    timer = Timer()

    logging.info(args)
    config = mobilenetv1_ssd_config  # config file for priors etc.
    train_transform = TrainAugmentation(config.image_size, config.image_mean, config.image_std)
    target_transform = MatchPrior(config.priors, config.center_variance,
                                  config.size_variance, 0.5)

    test_transform = TestTransform(config.image_size, config.image_mean, config.image_std)

    logging.info("Prepare training datasets.")
    train_dataset = VIDDataset(args.datasets, args.cache_path, transform=train_transform,
                               target_transform=target_transform, batch_size=args.batch_size)
    label_file = os.path.join("models/", "vid-model-labels.txt")
    store_labels(label_file, train_dataset._classes_names)
    num_classes = len(train_dataset._classes_names)
    logging.info(f"Stored labels into file {label_file}.")
    logging.info("Train dataset size: {}".format(len(train_dataset)))
    train_loader = DataLoader(train_dataset, args.batch_size,
                              num_workers=args.num_workers,
                              shuffle=True)
    # logging.info("Prepare Validation datasets.")
    # val_dataset = VIDDataset(args.datasets, args.cache_path, transform=test_transform,
    #                            target_transform=target_transform, is_val=True)
    # logging.info(val_dataset)
    # logging.info("validation dataset size: {}".format(len(val_dataset)))

    # val_loader = DataLoader(val_dataset, args.batch_size,
    #                       num_workers=args.num_workers,
    #                       shuffle=False)
    # num_classes = 30
    logging.info("Build network.")
    pred_enc = MobileNetV1(num_classes=num_classes, alpha=args.width_mult)
    pred_dec = SSD(num_classes=num_classes, batch_size=args.batch_size, alpha=args.width_mult, is_test=False)
    if args.resume is None:
        net = MobileVOD(pred_enc, pred_dec)
        initialize_model(net)
    else:
        net = MobileVOD(pred_enc, pred_dec)
        print("Updating weights from resume model")
        net.load_state_dict(
            torch.load(args.resume,
                       map_location=lambda storage, loc: storage))

    min_loss = -10000.0
    last_epoch = -1

    base_net_lr = args.base_net_lr if args.base_net_lr is not None else args.lr
    ssd_lr = args.ssd_lr if args.ssd_lr is not None else args.lr
    # multi-GPU

    if args.freeze_net:
        logging.info("Freeze net.")
        for param in pred_enc.parameters():
            param.requires_grad = False
        net.pred_decoder.conv13.requires_grad = False

    criterion = MultiboxLoss(config.priors, iou_threshold=0.5, neg_pos_ratio=10,
                             center_variance=0.1, size_variance=0.2, device=DEVICE)
    optimizer = torch.optim.RMSprop(
        [{'params': [param for name, param in net.pred_encoder.named_parameters()], 'lr': base_net_lr},
         {'params': [param for name, param in net.pred_decoder.named_parameters()], 'lr': ssd_lr}, ], lr=args.lr,
        weight_decay=args.weight_decay, momentum=args.momentum)
    logging.info(f"Learning rate: {args.lr}, Base net learning rate: {base_net_lr}, "
                 + f"Extra Layers learning rate: {ssd_lr}.")

    # if args.scheduler == 'multi-step':
    #   logging.info("Uses MultiStepLR scheduler.")
    #   milestones = [int(v.strip()) for v in args.milestones.split(",")]
    #   scheduler = MultiStepLR(optimizer, milestones=milestones,
    #                                                gamma=0.1, last_epoch=last_epoch)
    # elif args.scheduler == 'cosine':
    #   logging.info("Uses CosineAnnealingLR scheduler.")
    #   scheduler = CosineAnnealingLR(optimizer, args.t_max, last_epoch=last_epoch)
    # else:
    #   logging.fatal(f"Unsupported Scheduler: {args.scheduler}.")
    #   parser.print_help(sys.stderr)
    #   sys.exit(1)
    #net = torch.nn.DataParallel(net, device_ids=[0, 1, 2, 3]).cuda()
    net.to(DEVICE)
    output_path = os.path.join(args.checkpoint_folder, f"lstm1_multigpu")
    if not os.path.exists(output_path):
        os.makedirs(os.path.join(output_path))
    logging.info(f"Start training from epoch {last_epoch + 1}.")
    for epoch in range(last_epoch + 1, args.num_epochs):
        # scheduler.step()
        h,c = net.pred_decoder.bottleneck_lstm1.init_hidden()
        train(train_loader, net, criterion, optimizer,
              device=DEVICE, debug_steps=args.debug_steps, epoch=epoch, sequence_length=args.sequence_length,
              hidden_state = h,cell_state =c)

        if epoch % args.validation_epochs == 0 or epoch == args.num_epochs - 1:
            # val_loss, val_regression_loss, val_classification_loss = val(val_loader, net, criterion, DEVICE)
            # logging.info(
            #   f"Epoch: {epoch}, " +
            #   f"Validation Loss: {val_loss:.4f}, " +
            #   f"Validation Regression Loss {val_regression_loss:.4f}, " +
            #   f"Validation Classification Loss: {val_classification_loss:.4f}"
            # )
            model_path = os.path.join(output_path, f"WM-{args.width_mult}-Epoch-{epoch}.pth")
            torch.save(net.state_dict(), model_path)
            logging.info(f"Saved model {model_path}")
samanthawyf commented 5 years ago

Hi, @Mindbooom, @vikrant7, have you trained the basenet? I am wondering whether the models provided in the repo was totally trained. And, can your multi-gpu code work now?

Mindbooom commented 5 years ago

@samanthawyf I haven't trained the basenet, just applying the basenet of epoch3 provided by @vikrant7 .And the mAP got from evaluate.py is 43% which is 10% less than the paper. I may try to train the basenet after I know how to use multi-GPU efficiently. Now the multi-GPU code commented above can work but with a 1/8 training speed compared with one GPU. I don't know why because when running on one gpu, the code have a equivalent speed with the raw code. If you are interested in it, please try this code on your own multi-GPUs and help us to improve the speed.

vikrant7 commented 5 years ago

Hi @Mindbooom, Thanks for sharing the multi-GPU scripts. Now I am free to actively work on this project.

Mindbooom commented 5 years ago

@vikrant7 Bro, I have a question about the evaluate part. Running evaluate.py with the basenet of epoch2, I got a mAP of 43%. image But there are 2 strange results. Firstly, when evaluating the lstm1 of epoch2 uploaded, the mAP is almost 0. image

Secondly, after the poor result, I trained a lstm1 of epoch0 using the basenet uploaded, the mAP is 21%. Seems that adding a lstm1 weaken the training result of basenet. image Can you help me to solve the questions? Thank you very much. P.S. the multi GPU script above leads to a Dataloader error and I'm trying to fix it. image

petinhoss7 commented 4 years ago

@Mindbooom Got almost the same results after training Lstm1 with the basenet provided, do you know why we got those results?