Improvement on motion-cnn result: 84.1% on split-1, with VGG-16

gaosh commented 6 years ago

Hi, all

I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding transforms.FiveCrop() to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down with transfroms.TenCrop().

I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.

imnotk commented 6 years ago

I have some problem about the accuracy , when i only use the center crop (224,224) with sample 25 frames, I can get about 80% on rgb modality , but when I use Five crop or ten crop , my accuracy decreased a lot whatever cnn net like resnet,inceptionv1,inceptionv2 . Can you explain why ?

gaosh commented 6 years ago

You should use this data augmentation during training to get desired results.

zhujian notifications@github.com于2018年9月13日周四03:15写道：

I have some problem about the accuracy , when i only use the center crop (224,224) with sample 25 frames, I can get about 80% on rgb modality , but when I use Five crop or ten crop , my accuracy decreased a lot whatever cnn net like resnet,inceptionv1,inceptionv2 . Can you explain why ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffreyhuang1/two-stream-action-recognition/issues/39#issuecomment-420907955, or mute the thread https://github.com/notifications/unsubscribe-auth/AQSH_idku3ui1Nlmd1uIAvNTvphSV6-rks5uagYFgaJpZM4WkYD8 .

gaosh commented 6 years ago

You can refer to the related paper, during training there are extensive data augmentation used such as multi-scale, corner crops, etc. The author of this project only used very simple data augmentation.

imnotk commented 6 years ago

I use the augmentation on train split ,but when used for test split the accuracy is far below 80%,but only use center crop for test split is almost 80%,maybe some difference between tf and pytorch

gaosh commented 6 years ago

That's weird, you can try these two models I converted from the project of their paper https://github.com/feichtenhofer/twostreamfusion, the link for the models https://drive.google.com/file/d/1JydxdPMEHU7uJnRyi8A8uF82jSgE9FGe/view?usp=sharing. They are VGG-16 models.

sxzy commented 6 years ago

I have choose the Only Testing. but the result shows that it still train data. it is so weird. can any one give me some tips.

gaosh commented 6 years ago

What is the results you got? You'd better open an new issue to discuss about this.

sxzy commented 5 years ago

Hi, all

I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding transforms.FiveCrop() to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down with transfroms.TenCrop().

I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.

FiveCrop

@gaosh hello .I ready to add this trick you have mentioned. but I am confused. this is the official docs 's way to use fiveCrop

transform = Compose([
         >>>    FiveCrop(size), # this is a list of PIL Images
         >>>    Lambda(lambda crops: torch.stack([ToTensor()(crop) for crop in crops])) # returns a 4D tensor

and I am confused that in the code .we have done some augementation like

      training_set = spatial_dataset(dic=self.dic_training, root_dir=self.data_path, mode='train', transform = transforms.Compose([
                transforms.RandomCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
                ]))

and I wonder how I add fiveCrop into it ???

gaosh commented 5 years ago

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

sxzy commented 5 years ago

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

thank you . you are really great

sxzy commented 5 years ago

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

and I also wonder how high is your accuracy in spatial part. I have tried to pretrain with vgg16 in spatial part.....but it seems that the result is not so satisfied..... just get around 72%

gaosh commented 5 years ago

I used trained model from other projects, I provided converted pytorch model link in previous comments in this issue. When test with 5-crops/center-crop, I can achieve around 82%/78% accuracy with spatial part.

sxzy commented 5 years ago

I used trained model from other projects, I provided converted pytorch model link in previous comments in this issue. When test with 5-crops/center-crop, I can achieve around 82%/78% accuracy with spatial part.

as you have mentioned . you have trained from other projects whose name is the two-stream fusion . and can you share your code about this project in pytorch ? I have noticed that you have share the pre-train model ,that is appreciate. and I wonder if you can also share the code about this project two-stream fusion . I have read the paper and the implementation is complicated for me now. so I will appreciate if you can share your pytorch code about this project

gaosh commented 5 years ago

Right now, I may not have time for sharing my code. But after CVPR deadline, I will refine the code concerning this project and make it public available. Regarding two-stream fusion, I didn't implement their code in pytorch, I just converted their pretrained model into pytorch.

sxzy commented 5 years ago

Right now, I may not have time for sharing my code. But after CVPR deadline, I will refine the code concerning this project and make it public available. Regarding two-stream fusion, I didn't implement their code in pytorch, I just converted their pretrained model into pytorch.

Ok .looking forward for the new post. and good luck in CVPR

duygusar commented 5 years ago

How do you achieve accuracy around 80? When I train the network, the validation loss oscillates and never really improves? What is the measurement of accuracy here? and like @sxzy I can't even run the test only. We can not use validation as is to pass as test because the parameters are updated (so it is part of training) I get these problems (both training and not being able to test only) even when I use the pretrained model and resume with it.

duygusar commented 5 years ago

@gaosh have you also augmented the motion data? The authors does not, and I would assume it would not be wise to do so because we would lose the motion information -alas- I need to reduce overfitting.

gaosh commented 5 years ago

@duygusar the motion data is also augmented. I think the authors of several early action recognition papers suggest to augment motion data, since they are tend to over-fitting if no augmentation is applied. I am quite certain that corner clip will improve the results. I use corner clip and achieve 59.9% accuracy on HMDB-51, without corner clip, it's around 57.3%, the result is based on model pretrained on ImageNet.

duygusar commented 5 years ago

@gaosh Thanks. I used randomcrop for traning motion data (and centercrop for the evaluation data) and then normalized the data to [0,1], now I don't get crazy jumps and downs in my validation loss, but the precision I get is 60-70% (resnet/for the first 6 classes of UCF101 / which should be much higher than UCF101 overall, and it is small but balanced enough to train without overfitting). Isn't your accuracy of UCF101 (around 80) overfitting?? When I run the code as is, I do get 80 and above (for 6 classes) but the network does not really converge and it would be a false measure without handling the problems of cyclical jumps and downs of validation loss and overfitting, no?

gaosh commented 5 years ago

@duygusar you don't have to worry too much about over-fitting at the beginning. 60-70% accuracy is lower than expected and I think it's irrelevant to overfitting, just train longer and tracking the changes of training loss. Also, if you use small models like resnet-18, the final performance will be lower than reported results in this repo.

duygusar commented 5 years ago

@gaosh when I shuffle the evaluation set I get low accuracy, It is around 80 but overfitting when I don't shuffle (in the repository it is not shuffled). And I can tell that it overfits because validation loss just won't go down after a while, and definitely does not converge even with smaller learning rates. By the way, in the repository I think the test set refers to evaluation set, is this correct? The evaluation set is not partitioned from the training set right? I am skimming through the code and I think test actually refers to evaluations set and if you needed an actual test you need to replace the test split with a new one (with unseen examples), I just found it peculiar and wanted to make sure if I am correct about this. So, I am confused about the reported accuracy because they don't provide a real test split. Is the accuracy on README the validation accuracy?

gaosh commented 5 years ago

@duygusar The validation set in this code is different from training set. I am not sure why you need shuffle validation set, but shuffle should not affect performance.

duygusar commented 5 years ago

@gaosh You are right, I don't need to shuffle as it is irrelevant but it does change the performance and I don't know why. The over-fitting remains either way (validation accuracy might be high but val loss does not converge), and I think the performance reported might be on validation set.

gaosh commented 5 years ago

@duygusar If val loss first go down and then go up, it may related to overfitting. However, if the val loss go down and stay at a certain value, even though the value is higher than training loss, it's common.

DoubleYing commented 5 years ago

I train the model with pretrained ResNet152, but I got the accuary only 30+%, I think it's too low, but I don't know how to imporve it. I use the open-source function of opencv to got my flow pics, may this causes the low accuary?

duygusar commented 5 years ago

@DoubleYing Have you changed the number of classes accordingly? UCF has 101 classes, what is the number of classes for your dataset? opencv's flow is not great but I think it shouldn't make a huge difference.

DoubleYing commented 5 years ago

yes, I have changed the classes number, and now I'm considering to change a way to extract flow. If I get a good result later, I will note here. Thanks for your answer.

duygusar commented 5 years ago

@DoubleYing On my dataset, which should be somewhat easy and balanced, I also get lower accuracies for motion, I also use cv2's farneback (because it is easy and fast, I can change to a course to fine one though I prefer a faster algorithm, I will just skip the deep learning one they used because I have limited time before a deadline :( ). Did you manage to improve your results? @gaosh do you have any references to your changes on the motion-cnn part (especially motion dataloader, but if possible VGG modifications on network part too)? I would really appreciate if you can refer me to your changes. Getting 5 random crops, I should handle a tuple of images instead of a PIL image (TypeError: pic should be PIL Image or ndarray. Got <type 'tuple'>), I am kind of confused on how to go around that in the program in train/test, and there is also the channels, how to stack fivecrops...

duygusar commented 5 years ago

@gaosh Using Lambda, I get the error, at line 55, in stackopf flow[2*(j),:,:] = H
RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[224, 224]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (4)

and when I try to set flow = torch.FloatTensor(5, 2*self.in_channel,self.img_rows,self.img_cols)

I get motion_dataloader.py", line 55, in stackopf flow[:,2*(j),:,:] = H RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[5, 224, 224]): the number of sizes provided (3) must be greater or equal to the number of dimensions in the tensor (4)

when I multiply the train batchsize by 5 that is returned, I also get the same error.

gaosh commented 5 years ago

Your also need to modify the code within motion_dataloader.py.

def stackopf(self, video_name, clip_idx, nb_clips=None):
        name = 'v_' + video_name
        u = self.flow_root_dir + 'u/' + name
        v = self.flow_root_dir + 'v/' + name

        if self.fiveCrops:
            self.ncrops = 5
        else:
            self.ncrops = 1

        flow = torch.FloatTensor(self.ncrops, 2 * self.in_channel, self.img_rows, self.img_cols)
        #i = int(self.clips_idx)
        i = clip_idx

        for j in range(self.in_channel):
            idx = i + j
            if self.mode == 'train':
                if idx >= nb_clips+1:
                    idx = nb_clips+1
            idx = str(idx)

            frame_idx = 'frame' + idx.zfill(6)
            h_image = u + '/' + frame_idx + '.jpg'
            v_image = v + '/' + frame_idx + '.jpg'

            imgH = (Image.open(h_image))
            imgV = (Image.open(v_image))

            H = self.flow_transform(imgH)
            V = self.flow_transform(imgV)

            if self.fiveCrops:
                flow[:, 2 * (j - 1), :, :] = H.squeeze()
                flow[:, 2 * (j - 1) + 1, :, :] = V.squeeze()
            else:
                flow[:, 2 * (j - 1), :, :] = H
                flow[:, 2 * (j - 1) + 1, :, :] = V

            imgH.close()
            imgV.close()

        return flow.squeeze()

Please also notice that the returned image from dataloader will have size of (n_crops, batchsize, n_channels, height, weight) . You need to resize the batch to (n_crops*batchsize, n_channels, height, weight) . You can check official reference too.

duygusar commented 5 years ago

@gaosh Thank you so much for your time!! Really appreciate. How do you take care of the labels though, shouldn't they be also populated for training? I have made the edits but I have the error below which I believe is a mismatch with the labels: (I have only batch_size 6 now that I have 5 crops, so batc size * crops = 30)

duygusar commented 5 years ago

@gaosh This is how my data loader looks by the way, I can't find what I am doing wrong

import numpy as np
import pickle
from PIL import Image
import time
import shutil
import random
import argparse

from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as transforms
import torchvision.models as models
import torch.nn as nn
import torch
import torch.backends.cudnn as cudnn
from torch.autograd import Variable
from torch.optim.lr_scheduler import ReduceLROnPlateau

from split_train_test_video import *

class motion_dataset(Dataset):  
    def __init__(self, dic, in_channel, root_dir, mode, transform=None):
        #Generate a 16 Frame clip
        self.keys=dic.keys()
        self.values=dic.values()
        self.root_dir = root_dir
        self.transform = transform
        self.mode=mode
        self.in_channel = in_channel
        self.img_rows=224
        self.img_cols=224
        self.fiveCrops = True

    def stackopf(self,video_name, clip_idx, nb_clips=None):
        name = self.video
        u = self.root_dir+ 'u/' + name
        v = self.root_dir+ 'v/'+ name

        if self.fiveCrops:
            self.ncrops = 5
        else:
            self.ncrops = 1

        flow = torch.FloatTensor(self.ncrops,2*self.in_channel,self.img_rows,self.img_cols)
        #i = int(self.clips_idx)
        i = int(clip_idx)

        for j in range(self.in_channel):
            idx = i + j
            if self.mode == 'train':
                if idx >= nb_clips+1:
                    idx = nb_clips+1
            idx = str(idx)
            frame_idx = 'frame'+ idx.zfill(6) #6zeros for frame name
            h_image = u +'/' + frame_idx +'.jpg'
            v_image = v +'/' + frame_idx +'.jpg'

            imgH=(Image.open(h_image))
            imgV=(Image.open(v_image))

            H = self.transform(imgH)
            V = self.transform(imgV)

            if self.fiveCrops:
                flow[:, 2 * (j - 1), :, :] = H.squeeze()
                flow[:, 2 * (j - 1) + 1, :, :] = V.squeeze()
            else:
                flow[:, 2 * (j - 1), :, :] = H
                flow[:, 2 * (j - 1) + 1, :, :] = V

            imgH.close()
            imgV.close()  

        return flow.squeeze()

    def __len__(self):
        return len(self.keys)

    def __getitem__(self, idx):
        #print ('mode:',self.mode,'calling Dataset:__getitem__ @ idx=%d'%idx)
        nb_clips=0

        if self.mode == 'train':
            self.video, nb_clips = self.keys[idx].split('-')
            self.clips_idx = random.randint(1,int(nb_clips))
        elif self.mode == 'val':
            self.video,self.clips_idx = self.keys[idx].split('-')
        else:
            raise ValueError('There are only train and val mode')

        label = self.values[idx]
        label = int(label)-1 
        data = self.stackopf(self.video, self.clips_idx, int(nb_clips))
        #len(data)

        if self.mode == 'train':
            sample = (data,label)
        elif self.mode == 'val':
            sample = (self.video,data,label)
        else:
            raise ValueError('There are only train and val mode')
        return sample

class Motion_DataLoader():
    def __init__(self, BATCH_SIZE, num_workers, in_channel,  path, ucf_list, ucf_split):

        self.BATCH_SIZE=BATCH_SIZE
        self.num_workers = num_workers
        self.frame_count={}
        self.in_channel = in_channel
        self.data_path=path
        # split the training and testing videos
        splitter = UCF101_splitter(path=ucf_list,split=ucf_split)
        self.train_video, self.test_video = splitter.split_video()

    def load_frame_count(self):
        #print '==> Loading frame number of each video'
        with open('/media/d/DATA_2/two-stream-action-recognition-master/dataloader/dic/frame_count_j.pickle','rb') as file:
            dic_frame = pickle.load(file)
        file.close()

        for line in dic_frame :  #'v_Lunges_g07_c01.avi'
            #videoname = line.split('_',1)[1].split('.',1)[0]  #Lunges_g07_c01
            #n,g = videoname.split('_',1)
            #if n == 'HandStandPushups':
            #    videoname = 'HandstandPushups_'+ g
            self.frame_count[line]=dic_frame[line] 

    def run(self):
        self.load_frame_count()
        self.get_training_dic()
        self.val_sample19()
        train_loader = self.train()
        print len(train_loader.dataset)
        val_loader = self.val()

        return train_loader, val_loader, self.test_video

    def val_sample19(self):
        self.dic_test_idx = {}
        #print len(self.test_video)
        for video in self.test_video:   #Knot_Tying_D001_000041_000170     #ApplyEyeMakeup_g01_c01
            if self.frame_count[video]>27 and self.frame_count[video]<1200: # CHANGE
                #n,g = video.split('_',1)    #v_ApplyEyeMakeup_g01_c01.avi

                sampling_interval = int((self.frame_count[video]-10+1)/19)
                for index in range(19):
                    clip_idx = index*sampling_interval
                    key = video + '-' + str(clip_idx+1)
                    self.dic_test_idx[key] = self.test_video[video]

    def get_training_dic(self):
        self.dic_video_train={}
        for video in self.train_video:
            if self.frame_count[video]>27 and self.frame_count[video]<1200: # CHANGE!
                nb_clips = self.frame_count[video]-10+1
                key = video +'-' + str(nb_clips)
                self.dic_video_train[key] = self.train_video[video] 

    def train(self):
        training_set = motion_dataset(dic=self.dic_video_train, in_channel=self.in_channel, root_dir=self.data_path,
            mode='train',
            transform = transforms.Compose([
            transforms.Resize([256,256]),
            transforms.FiveCrop([224, 224]),
            transforms.Lambda(lambda crops: torch.stack([transforms.ToTensor()(crop) for crop in crops]))
            #transforms.RandomCrop([224, 224]),
            #transforms.ToTensor(),
            #transforms.Normalize([0.5], [0.5])
            ]))
        print '==> Training data :',len(training_set),' videos',training_set[1][0].size()

        train_loader = DataLoader(
            dataset=training_set, 
            batch_size=self.BATCH_SIZE,
            shuffle=True,
            num_workers=self.num_workers,
            pin_memory=True
            )
        return train_loader

    def val(self):
        validation_set = motion_dataset(dic= self.dic_test_idx, in_channel=self.in_channel, root_dir=self.data_path ,
            mode ='val',
            transform = transforms.Compose([
            transforms.Resize([256,256]),
        #transforms.CenterCrop([224, 224]),
            #transforms.ToTensor(),
            transforms.FiveCrop([224, 224]),
            transforms.Lambda(lambda crops: torch.stack([transforms.ToTensor()(crop) for crop in crops]))
            #transforms.Normalize([0.5], [0.5])
            ]))
        print '==> Validation data :',len(validation_set),' frames',validation_set[1][1].size()
        #print validation_set[1]

        val_loader = DataLoader(
            dataset=validation_set, 
            batch_size=self.BATCH_SIZE, 
            shuffle=True,
            num_workers=self.num_workers)

        return val_loader

if __name__ == '__main__':
    data_loader =Motion_DataLoader(BATCH_SIZE=1,num_workers=1,in_channel=10,
                                        path='/media/d/DATA_2/two-stream-action-recognition-master/J/flow/',
                                        ucf_list='/media/d/DATA_2/two-stream-action-recognition-master/J_list',
                                        ucf_split='01'
                                        )
    train_loader,val_loader,test_video = data_loader.run()
    #print train_loader,val_loader

gaosh commented 5 years ago

You can directly refer to official reference, I will just paste it here:

>>> transform = Compose([
>>>    FiveCrop(size), # this is a list of PIL Images
>>>    Lambda(lambda crops: torch.stack([ToTensor()(crop) for crop in crops])) # returns a 4D tensor
>>> ])
>>> #In your test loop you can do the following:
>>> input, target = batch # input is a 5d tensor, target is 2d
>>> bs, ncrops, c, h, w = input.size()
>>> result = model(input.view(-1, c, h, w)) # fuse batch size and ncrops
>>> result_avg = result.view(bs, ncrops, -1).mean(1) # avg over crops

Basically, you just average across ncrops, and you will have the same batch size with label.

duygusar commented 5 years ago

@gaosh , thanks, I did the averaging for test part using the document you sent, but do you average them for training? I thought for training, using fiveCrop (or ten) we create more data and use them all to train the model and for testing we take five or ten crops and we take the average of their predictions for a final decision (so only for testing we average). I have worked with Caffe before and this is how a lot of models do the five/ten crops (with mirroring).

gaosh commented 5 years ago

@duygusar yeah, you also need to average for training. Alternatively, you can expand label to match the shape of the model output. But I think it won't be much variance, you can written down cross entropy loss for these two cases, and compare them.

duygusar commented 5 years ago

@gaosh Thanks a lot! I really appreciate your help and comments :) It has helped a lot

duygusar commented 5 years ago

@gaosh I didn't find it very clear in the original paper, so this repository does not really implement TSN for motion cnn, but only for the spatial network, correct? It only takes 10 random frames per video and stacks them and doesn't have a consensus between the stacks..In the original paper, they use TSN (one frame from every 3 and then sends learned weights to consensus, so in a way these stacks are jointly learned), correct? Needed a sanity check here. This might be why they don't achieve high on motion?

gaosh commented 5 years ago

@duygusar I think the author didn't fully implement TSN, he didn't use partial bn and dropout. He just combined some parts from the three papers he listed. When training flow network, I think he sampled a start point and using following 16 frames as the input to flow net.

dangao250 commented 5 years ago

兄弟姐妹们，你们可以发一份预训练模型的安装包给我吗现在博主这里不行谢谢大家了1337932153@qq.com

lalala666-creator commented 4 years ago

Excuse me, is there a two-stream network program for tensorflow version?

honest2017 commented 4 years ago

Hi， all Please tell me how the frame_count.pickle file is generated. Thank you. I want to use hmdb51

wangaoqi-waq commented 4 years ago

Hey, can you send me a conversion VGG16 good model? Link https://drive.google.com/file/d/1JydxdPMEHU7uJnRyi8A8uF82jSgE9FGe/ view? Usp = sharing 。I use VPN can not download. Thank you, my E-mail is 1045251489 @qq.com

jeffreyyihuang / two-stream-action-recognition

Improvement on motion-cnn result: 84.1% on split-1, with VGG-16 #39