Training r2plus1d from scratch, severe overfitting

longweiwei commented 4 years ago

Notice

There are several common situations in the reimplementation issues as below

Reimplement a model in the model zoo using the provided configs
Reimplement a model in the model zoo on other dataset (e.g., custom datasets)
Reimplement a custom model but all the components are implemented in MMAction2
Reimplement a custom model with new modules implemented by yourself

There are several things to do for different cases as below.

For case 1 & 3, please follow the steps in the following sections thus we could help to quick identify the issue.
For case 2 & 4, please understand that we are not able to do much help here because we usually do not know the full code and the users should be responsible to the code they write.
One suggestion for case 2 & 4 is that the users should first check whether the bug lies in the self-implemented code or the original code. For example, users can first make sure that the same model runs well on supported datasets. If you still need help, please describe what you have done and what you obtain in the issue, and follow the steps in the following sections and try as clear as possible so that we can better help you.

Checklist

I have searched related issues but cannot get the expected help.
The issue has not been fixed in the latest version.

Describe the issue

According to the configuration file of the project, when training r2d from scratch, severe overfitting occurred

Reproduction

What command or script did you run?


./tools/dist_train.sh  configs/recognition/r2plus1d/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py  4

2. What config dir you run?

A placeholder for the config.

3. Did you make any modifications on the code or config? Did you understand what you have modified?

the only changed is data format changed from picture to video

4. What dataset did you use?

kinetics

**Environment**

1. Please run `PYTHONPATH=${PWD}:$PYTHONPATH python mmaction/utils/collect_env.py` to collect necessary environment information and paste it here.

sys.platform: linux Python: 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) [GCC 7.5.0] CUDA available: True GPU 0,1,2,3: Tesla V100-DGXS-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.168 GCC: gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0 PyTorch: 1.5.1+cu101 PyTorch compiling details: PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2019.0.5 Product Build 20190808 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 10.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_INTERNAL_THREADPOOL_IMPL -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

TorchVision: 0.6.1+cu101 OpenCV: 4.4.0 MMCV: 1.1.5 MMCV Compiler: GCC 7.3 MMCV CUDA Compiler: 10.1 MMAction2: 0.7.0+94895ec


2. You may add addition that may be helpful for locating the problem, such as
    - How you installed PyTorch [e.g., pip, conda, source]
    - Other environment variables that may be related (such as `$PATH`, `$LD_LIBRARY_PATH`, `$PYTHONPATH`, etc.)

The installation process is based on [this](https://github.com/open-mmlab/mmaction2/blob/master/docs/install.md)

**Results**

If applicable, paste the related results here, e.g., what you expect and what you get.
[my config file](https://github.com/longweiwei/errorReproduce/blob/main/mmaction2/other_files/r2plus1d_r34_8x8x1_180e_kinetics400_rgb.py)
[my training log file](https://github.com/longweiwei/errorReproduce/blob/main/mmaction2/other_files/20201031_010757.log)

the expectd log file should like [this](https://download.openmmlab.com/mmaction/recognition/r2plus1d/r2plus1d_r34_256p_8x8x1_180e_kinetics400_rgb/20200728_021421.log.json).
**Issue fix**

If you have already identified the reason, you can provide the information here. If you are willing to create a PR to fix it, please also leave a comment here and that would be much appreciated!

innerlee commented 4 years ago

what's the decord version? from which epoch does the overfitting become obvious? have you properly rescale lr? did you use the full kinetics400?

longweiwei commented 4 years ago

@innerlee

the decord version is 0.4.0 . I found the updated version(eg. 0.4.1) always wrong.
After 25 epochs , overfitting becomes obvious.
the total batch size is 80. I think lr should not have such a big impact。
yes, I used full kinetics400.

innerlee commented 4 years ago

could you randomly select 100 videos, for each vid, randomly access three frames, and compare frames decoded from decord and opencv?

longweiwei commented 3 years ago

@innerlee I completed the experiment according to your method。the value of every frame from decord and opencv is same. The test code is as follows：

import cv2
import decord
import numpy as np
ann_file_train = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/kinetics400_train_list_videos.txt'
data_root = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/videos_train'

file = open(ann_file_train, 'r')
lines = file.readlines()

arr = np.random.choice(len(lines), size = (100), replace = False)
print(arr)

re = []
for index in arr:
    line = lines[index].strip()
    path, _ = line.split()
    path = data_root + '/' + path
    de = decord.VideoReader(path)
    cv = cv2.VideoCapture(path)
    frames = len(de)
    print(path)
    for i in range(frames):
        de_im1 = de[i].asnumpy()
        cv.set(cv2.CAP_PROP_POS_FRAMES,i) #设置帧数标记
        _,cv_im1 = cv.read()
        cv_im1 = cv2.cvtColor(cv_im1, cv2.COLOR_BGR2RGB)
        re.append(np.all(cv_im1 == de_im1))

print(re)
for b in re:
    if not b:
        print("false")

Actually, The test result is the same as that given in mmaction2 by loading trained model weight given in mmaction for r2plus1d, so decord tool should be ok.

innerlee commented 3 years ago

could you pls try

randomly access three frames

longweiwei commented 3 years ago

@innerlee Sorry, I didn’t look carefully。 I tried. the value of every frame from decord and opencv is same. The test code is as follows：

import cv2
import decord
import numpy as np
ann_file_train = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/kinetics400_train_list_videos.txt'
data_root = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/videos_train'

file = open(ann_file_train, 'r')
lines = file.readlines()

arr = np.random.choice(len(lines), size = (100), replace = False)
print(arr)

re = []
for index in arr:
    line = lines[index].strip()
    path, _ = line.split()
    path = data_root + '/' + path
    de = decord.VideoReader(path)
    cv = cv2.VideoCapture(path)
    frames = len(de)
    print(path)
    three_frame = np.random.choice(frames, size = (3,), replace = False)
    for i in three_frame:
        de_im1 = de[i].asnumpy()
        cv.set(cv2.CAP_PROP_POS_FRAMES, i)  # 设置帧数标记
        _, cv_im1 = cv.read()
        cv_im1 = cv2.cvtColor(cv_im1, cv2.COLOR_BGR2RGB)
        re.append(np.all(cv_im1 == de_im1))
print(re)
for b in re:
    if not b:
        print("fff")

innerlee commented 3 years ago

for opencv, use sequential reading instead of cv.set(cv2.CAP_PROP_POS_FRAMES, i), because CAP_PROP_POS_FRAMES is inexact. For decord, keep the current form

innerlee commented 3 years ago

ref https://github.com/dmlc/decord/pull/77#issue-443862889

longweiwei commented 3 years ago

@innerlee i tried once again, The result is the same as before。 The test code is as follows：

import cv2
import decord
import numpy as np
ann_file_train = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/kinetics400_train_list_videos.txt'
data_root = '/raid/Research/workspace/lw/mmaction3/data/kinetics400/videos_train'

file = open(ann_file_train, 'r')
lines = file.readlines()

arr = np.random.choice(len(lines), size = (100), replace = False)
print(arr)

count = 0
re = []
for index in arr:
    line = lines[index].strip()
    path, _ = line.split()
    path = data_root + '/' + path
    de = decord.VideoReader(path)
    cv = cv2.VideoCapture(path)
    frames = len(de)
    print(path)
    three_frame = np.random.choice(frames, size = (3,), replace = False)
    for i in range(frames):
        _, cv_im1 = cv.read()
        cv_im1 = cv2.cvtColor(cv_im1, cv2.COLOR_BGR2RGB)
        if i in three_frame:
            de_im1 = de[i].asnumpy()
            if np.sum(abs(de_im1 - cv_im1)) != 0:
                count += 1
                break
print(count)

Thank you for your kind help.

innerlee commented 3 years ago

@dreamerlin any insight?

SuX97 commented 3 years ago

@innerlee

the decord version is 0.4.0 . I found the updated version(eg. 0.4.1) always wrong.

After 25 epochs , overfitting becomes obvious.

the total batch size is 80. I think lr should not have such a big impact。

yes, I used full kinetics400.

I noticed that in your log the videos_per_gpu is set to 36, but lr still being 0.1. The learning rate do have great impact on convergence. The standard usage is lr0.1 : 64samples(mini-batch 8 * 8gpus).

innerlee commented 3 years ago

yeah that's one suspect

longweiwei commented 3 years ago

@innerlee

the decord version is 0.4.0 . I found the updated version(eg. 0.4.1) always wrong.

After 25 epochs , overfitting becomes obvious.

the total batch size is 80. I think lr should not have such a big impact。

yes, I used full kinetics400.

I noticed that in your log the videos_per_gpu is set to 36, but lr still being 0.1. The learning rate do have great impact on convergence. The standard usage is lr0.1 : 64samples(mini-batch 8 * 8gpus).

thanks you for reminding. I have a question, that is, my batchsize to learning rate should be equal to 64 to 0.1 based linear scaling principle.

SuX97 commented 3 years ago

thanks you for reminding. I have a question, that is, my batchsize to learning rate should be equal to 64 to 0.1 based linear scaling principle.

Hi @longweiwei , if your total batchsize is 80, than the lr in config should be set to 80 / 64 * 0.1 = 0.125. Try it out.

longweiwei commented 3 years ago

@SuX97 Roger that. thanks.

longweiwei commented 3 years ago

I accordingly increased the lr based your default lr config, e.g. 0.1 for 64 videos per iteration. the performance in kinetics val dataset has imporved partly. but overfit is still exist. The best result on the validation set is only about 25% train config trian log

Any suggestions！ thks.

SuX97 commented 3 years ago

That's weird. Could you please print out the tensor of each stage and major layers(stem, res-stages, pools, fc). I suspect that maybe some change in the BaseClass caused this issue; And check your kinetics dataset for whether the #videos matches.

longweiwei commented 3 years ago

@SuX97 thanks for your reminder. It turns out that my local video is inconsistent with the official one. I finally know the reason after being confused for so long.

innerlee commented 3 years ago

yeah data is to blame

open-mmlab / mmaction2

Training r2plus1d from scratch, severe overfitting #311