open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.27k stars 1.24k forks source link

Training r2plus1d from scratch, severe overfitting #311

Closed longweiwei closed 3 years ago

longweiwei commented 4 years ago

Notice

There are several common situations in the reimplementation issues as below

  1. Reimplement a model in the model zoo using the provided configs
  2. Reimplement a model in the model zoo on other dataset (e.g., custom datasets)
  3. Reimplement a custom model but all the components are implemented in MMAction2
  4. Reimplement a custom model with new modules implemented by yourself

There are several things to do for different cases as below.

Checklist

  1. I have searched related issues but cannot get the expected help.
  2. The issue has not been fixed in the latest version.

Describe the issue

According to the configuration file of the project, when training r2d from scratch, severe overfitting occurred

Reproduction

  1. What command or script did you run?
    
    ./tools/dist_train.sh  configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py  4
2. What config dir you run?

A placeholder for the config.

3. Did you make any modifications on the code or config? Did you understand what you have modified?

the only changed is data format changed from picture to video

4. What dataset did you use?

kinetics

**Environment**

1. Please run `PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py` to collect necessary environment information and paste it here.

sys.platform: linux Python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3: Tesla V100-DGXS-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.168 GCC: gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 PyTorch: 1.5.1+cu101 PyTorch compiling details: PyTorch built with:

TorchVision: 0.6.1+cu101 OpenCV: 4.4.0 MMCV: 1.1.5 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMAction2: 0.7.0+94895ec


2. You may add addition that may be helpful for locating the problem, such as
    - How you installed PyTorch [e.g., pip, conda, source]
    - Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)

The installation process is based on [this](https://github.com/open-mmlab/mmaction2/blob/master/docs/install.md)

**Results**

If applicable, paste the related results here, e.g., what you expect and what you get.
[my config file](https://github.com/longweiwei/errorReproduce/blob/main/mmaction2/other_files/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py)
[my training log file](https://github.com/longweiwei/errorReproduce/blob/main/mmaction2/other_files/20201031_010757.log)

the expectd log file should like [this](https://download.openmmlab.com/mmaction/recognition/r2plus1d/r2plus1d_r34_256p_8x8x1_180e_kinetics400_rgb/20200728_021421.log.json).
**Issue fix**

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!
innerlee commented 4 years ago

what's the decord version? from which epoch does the overfitting become obvious? have you properly rescale lr? did you use the full kinetics400?

longweiwei commented 4 years ago

@innerlee

innerlee commented 4 years ago

could you randomly select 100 videos, for each vid, randomly access three frames, and compare frames decoded from decord and opencv?

longweiwei commented 3 years ago

@innerlee I completed the experiment according to your method。the value of every frame from decord and opencv is same. The test code is as follows:

import cv2
import decord
import numpy as np
ann_file_train = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/kinetics400_train_list_videos.txt'
data_root = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/videos_train'

file = open(ann_file_train, 'r')
lines = file.readlines()

arr = np.random.choice(len(lines), size = (100), replace = False)
print(arr)

re = []
for index in arr:
    line = lines[index].strip()
    path, _ = line.split()
    path = data_root + '/' + path
    de = decord.VideoReader(path)
    cv = cv2.VideoCapture(path)
    frames = len(de)
    print(path)
    for i in range(frames):
        de_im1 = de[i].asnumpy()
        cv.set(cv2.CAP_PROP_POS_FRAMES,i) #设置帧数标记
        _,cv_im1 = cv.read()
        cv_im1 = cv2.cvtColor(cv_im1, cv2.COLOR_BGR2RGB)
        re.append(np.all(cv_im1 == de_im1))

print(re)
for b in re:
    if not b:
        print("false")

Actually, The test result is the same as that given in mmaction2 by loading trained model weight given in mmaction for r2plus1d, so decord tool should be ok.

innerlee commented 3 years ago

could you pls try

randomly access three frames

longweiwei commented 3 years ago

@innerlee Sorry, I didn’t look carefully。 I tried. the value of every frame from decord and opencv is same. The test code is as follows:

import cv2
import decord
import numpy as np
ann_file_train = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/kinetics400_train_list_videos.txt'
data_root = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/videos_train'

file = open(ann_file_train, 'r')
lines = file.readlines()

arr = np.random.choice(len(lines), size = (100), replace = False)
print(arr)

re = []
for index in arr:
    line = lines[index].strip()
    path, _ = line.split()
    path = data_root + '/' + path
    de = decord.VideoReader(path)
    cv = cv2.VideoCapture(path)
    frames = len(de)
    print(path)
    three_frame = np.random.choice(frames, size = (3,), replace = False)
    for i in three_frame:
        de_im1 = de[i].asnumpy()
        cv.set(cv2.CAP_PROP_POS_FRAMES, i)  # 设置帧数标记
        _, cv_im1 = cv.read()
        cv_im1 = cv2.cvtColor(cv_im1, cv2.COLOR_BGR2RGB)
        re.append(np.all(cv_im1 == de_im1))
print(re)
for b in re:
    if not b:
        print("fff")
innerlee commented 3 years ago

for opencv, use sequential reading instead of cv.set(cv2.CAP_PROP_POS_FRAMES, i), because CAP_PROP_POS_FRAMES is inexact. For decord, keep the current form

innerlee commented 3 years ago

ref https://github.com/dmlc/decord/pull/77#issue-443862889

longweiwei commented 3 years ago

@innerlee i tried once again, The result is the same as before。 The test code is as follows:

import cv2
import decord
import numpy as np
ann_file_train = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/kinetics400_train_list_videos.txt'
data_root = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/videos_train'

file = open(ann_file_train, 'r')
lines = file.readlines()

arr = np.random.choice(len(lines), size = (100), replace = False)
print(arr)

count = 0
re = []
for index in arr:
    line = lines[index].strip()
    path, _ = line.split()
    path = data_root + '/' + path
    de = decord.VideoReader(path)
    cv = cv2.VideoCapture(path)
    frames = len(de)
    print(path)
    three_frame = np.random.choice(frames, size = (3,), replace = False)
    for i in range(frames):
        _, cv_im1 = cv.read()
        cv_im1 = cv2.cvtColor(cv_im1, cv2.COLOR_BGR2RGB)
        if i in three_frame:
            de_im1 = de[i].asnumpy()
            if np.sum(abs(de_im1 - cv_im1)) != 0:
                count += 1
                break
print(count)

Thank you for your kind help.

innerlee commented 3 years ago

@dreamerlin any insight?

SuX97 commented 3 years ago

@innerlee

  • the decord version is 0.4.0 . I found the updated version(eg. 0.4.1) always wrong.
  • After 25 epochs , overfitting becomes obvious.
  • the total batch size is 80. I think lr should not have such a big impact。
  • yes, I used full kinetics400.

I noticed that in your log the videos_per_gpu is set to 36, but lr still being 0.1. The learning rate do have great impact on convergence. The standard usage is lr0.1 : 64samples(mini-batch 8 * 8gpus).

innerlee commented 3 years ago

yeah that's one suspect

longweiwei commented 3 years ago

@innerlee

  • the decord version is 0.4.0 . I found the updated version(eg. 0.4.1) always wrong.
  • After 25 epochs , overfitting becomes obvious.
  • the total batch size is 80. I think lr should not have such a big impact。
  • yes, I used full kinetics400.

I noticed that in your log the videos_per_gpu is set to 36, but lr still being 0.1. The learning rate do have great impact on convergence. The standard usage is lr0.1 : 64samples(mini-batch 8 * 8gpus).

thanks you for reminding. I have a question, that is, my batchsize to learning rate should be equal to 64 to 0.1 based linear scaling principle.

SuX97 commented 3 years ago

thanks you for reminding. I have a question, that is, my batchsize to learning rate should be equal to 64 to 0.1 based linear scaling principle.

Hi @longweiwei , if your total batchsize is 80, than the lr in config should be set to 80 / 64 * 0.1 = 0.125. Try it out.

longweiwei commented 3 years ago

@SuX97 Roger that. thanks.

longweiwei commented 3 years ago

I accordingly increased the lr based your default lr config, e.g. 0.1 for 64 videos per iteration. the performance in kinetics val dataset has imporved partly. but overfit is still exist. The best result on the validation set is only about 25% train config trian log

Any suggestions! thks.

SuX97 commented 3 years ago

That's weird. Could you please print out the tensor of each stage and major layers(stem, res-stages, pools, fc). I suspect that maybe some change in the BaseClass caused this issue; And check your kinetics dataset for whether the #videos matches.

longweiwei commented 3 years ago

@SuX97 thanks for your reminder. It turns out that my local video is inconsistent with the official one. I finally know the reason after being confused for so long.

innerlee commented 3 years ago

yeah data is to blame