Bugs for processing MMPD dataset when the `DO_PREPROCESS` is set to False

Hi,

I found the program will throw an error ValueError: ('train', 'File list empty. Check preprocessed data folder exists and is not empty.') when the config file set theDO_PREPROCESS: False for MMPD but not for UBFC-rPPG.

I located the bug happened at https://github.com/ubicomplab/rPPG-Toolbox/blob/53b84584c2501f40ac925e141e7b908d1013d002/dataset/data_loader/BaseLoader.py#L487

Since in https://github.com/ubicomplab/rPPG-Toolbox/blob/53b84584c2501f40ac925e141e7b908d1013d002/dataset/data_loader/MMPDLoader.py#L56 the dict key index is used for the video index of each subject instead of subject index. In the UBFC-rPPGLoader, the index is defined by dirs = [{"index": re.search('subject(\d+)', data_dir).group(0), "path": data_dir} for data_dir in data_dirs] which would include string subject, but this is not the case for MMPDLoader which I think cause this bug.

I hard-coded the block to solve this issue by changing two lines:

    def build_file_list_retroactive(self, data_dirs, begin, end):
        """ If a file list has not already been generated for a specific data split build a list of files 
        used by the dataloader for the data split. Eg. list of files used for 
        train / val / test. Also saves the list to a .csv file.

        Args:
            data_dirs(List[str]): a list of video_files.
            begin(float): index of begining during train/val split.
            end(float): index of ending during train/val split.
        Returns:
            None (this function does save a file-list .csv file to self.file_list_path)
        """

        # get data split based on begin and end indices.
        data_dirs_subset = self.split_raw_data(data_dirs, begin, end)

        # generate a list of unique raw-data file names
        filename_list = []
        for i in range(len(data_dirs_subset)):
            # filename_list.append(data_dirs_subset[i]['index'])
            filename_list.append(data_dirs_subset[i]['subject'])
        filename_list = list(set(filename_list))  # ensure all indexes are unique

        # generate a list of all preprocessed / chunked data files
        file_list = []
        for fname in filename_list:
            # processed_file_data = list(glob.glob(self.cached_path + os.sep + "{0}_input*.npy".format(fname)))
            processed_file_data = list(glob.glob(self.cached_path + os.sep + "subject{0}_*_input*.npy".format(fname)))
            file_list += processed_file_data

        if not file_list:
            raise ValueError(self.dataset_name,
                             'File list empty. Check preprocessed data folder exists and is not empty.')

        file_list_df = pd.DataFrame(file_list, columns=['input_files'])
        os.makedirs(os.path.dirname(self.file_list_path), exist_ok=True)
        file_list_df.to_csv(self.file_list_path)  # save file list to .csv

commented lines are the original codes.

Hi @Dylan-H-Wang,

I'm unable to test this myself at the moment, just to clarify I understand the bug you noticed since the issue title and your post seem a bit confusing together: you are saying that when pre-processing a dataset for the first time, if someone forgets to set DO_PREPROCESS to True, the appropriate error is thrown for the MMPD dataset but not the UBFC-rPPG dataset? Or that even if you have the MMPD dataset preprocessed, the error is still shown?

Once you clarify I can look into this a bit further, though I'm not sure if the fixes you applied would work since it may cause a similar issue to pop up for other datasets which rely on that index key.

Here is my MMPD data preprocessing YAML file:

# use default train/val split with ratio 8/2
BASE: ['']
TRAIN:
  DATA:
    INFO:
      LIGHT: [1, 2, 3]  # 1 - LED-Low, 2 - LED-high, 3 - Incandescent, 4 - Nature
      MOTION: [1] # 1 - Stationary, 2 - Rotation, 3 - Talking, 4 - Walking
      EXERCISE: [2] # 1 - True, 2 - False
      SKIN_COLOR: [3] # Fitzpatrick Scale Skin Types - 3, 4, 5, 6
      GENDER: [1, 2]  # 1 - Male, 2 - Female
      GLASSER: [1, 2] # 1 - True, 2 - False
      HAIR_COVER: [1, 2] # 1 - True, 2 - False
      MAKEUP: [1, 2] # 1 - True, 2 - False
    FS: 30
    DATASET: MMPD
    DO_PREPROCESS: False                   # if first time, should be true
    DATA_FORMAT: NDCHW
    DATA_PATH:   "./data/MMPD/raw_data"          # Raw dataset path, need to be updated
    CACHED_PATH: "./data/MMPD/preprocessed_data"      # Processed dataset save path, need to be updated
    FILE_LIST_PATH: "./data/MMPD/preprocessed_data/data_file_list"    # Path to store file lists, needs to be updated
    EXP_DATA_NAME: ""
    BEGIN: 0.0
    END: 0.8
    PREPROCESS:
      DATA_TYPE: ['Standardized' ]
      LABEL_TYPE: Standardized
      DO_CHUNK: True
      CHUNK_LENGTH: 180
      CROP_FACE:
        DO_CROP_FACE: True
        USE_LARGE_FACE_BOX: True
        LARGE_BOX_COEF: 1.5
        DETECTION:
          DO_DYNAMIC_DETECTION: False
          DYNAMIC_DETECTION_FREQUENCY : 30
          USE_MEDIAN_FACE_BOX: False    # This should be used ONLY if dynamic detection is used
      RESIZE:
        H: 72
        W: 72
VALID:
  DATA:
    INFO:
      LIGHT: [1, 2, 3]  # 1 - LED-Low, 2 - LED-high, 3 - Incandescent, 4 - Nature
      MOTION: [1] # 1 - Stationary, 2 - Rotation, 3 - Talking, 4 - Walking
      EXERCISE: [2] # 1 - True, 2 - False
      SKIN_COLOR: [3] # Fitzpatrick Scale Skin Types - 3, 4, 5, 6
      GENDER: [1, 2]  # 1 - Male, 2 - Female
      GLASSER: [1, 2] # 1 - True, 2 - False
      HAIR_COVER: [1, 2] # 1 - True, 2 - False
      MAKEUP: [1, 2] # 1 - True, 2 - False
    FS: 30
    DATASET: MMPD
    DO_PREPROCESS: False                # if first time, should be true
    DATA_FORMAT: NDCHW
    DATA_PATH:   "./data/MMPD/raw_data"          # Raw dataset path, need to be updated
    CACHED_PATH: "./data/MMPD/preprocessed_data"      # Processed dataset save path, need to be updated
    FILE_LIST_PATH: "./data/MMPD/preprocessed_data/data_file_list"    # Path to store file lists, needs to be updated
    EXP_DATA_NAME: ""
    BEGIN: 0.8
    END: 1.0
    PREPROCESS:
      DATA_TYPE: [ 'Standardized' ]
      LABEL_TYPE: Standardized
      DO_CHUNK: True
      CHUNK_LENGTH: 180
      CROP_FACE:
        DO_CROP_FACE: True
        USE_LARGE_FACE_BOX: True
        LARGE_BOX_COEF: 1.5
        DETECTION:
          DO_DYNAMIC_DETECTION: False
          DYNAMIC_DETECTION_FREQUENCY : 30
          USE_MEDIAN_FACE_BOX: False    # This should be used ONLY if dynamic detection is used
      RESIZE:
        H: 72
        W: 72
TEST:
  USE_LAST_EPOCH: False
  DATA:
    INFO:
      LIGHT: [1, 2, 3]  # 1 - LED-Low, 2 - LED-high, 3 - Incandescent, 4 - Nature
      MOTION: [1] # 1 - Stationary, 2 - Rotation, 3 - Talking, 4 - Walking
      EXERCISE: [2] # 1 - True, 2 - False
      SKIN_COLOR: [3] # Fitzpatrick Scale Skin Types - 3, 4, 5, 6
      GENDER: [1, 2]  # 1 - Male, 2 - Female
      GLASSER: [1, 2] # 1 - True, 2 - False
      HAIR_COVER: [1, 2] # 1 - True, 2 - False
      MAKEUP: [1, 2] # 1 - True, 2 - False
    FS: 30
    DATASET: MMPD
    DO_PREPROCESS: True                   # if first time, should be true
    DATA_FORMAT: NDCHW
    DATA_PATH:   "./data/MMPD/raw_data"          # Raw dataset path, need to be updated
    CACHED_PATH: "./data/MMPD/preprocessed_data"      # Processed dataset save path, need to be updated
    FILE_LIST_PATH: "./data/MMPD/preprocessed_data/data_file_list"    # Path to store file lists, needs to be updated
    EXP_DATA_NAME: ""
    BEGIN: 0.0
    END: 1.0
    PREPROCESS:
      DATA_TYPE: ['Standardized' ]
      LABEL_TYPE: Standardized
      DO_CHUNK: True
      CHUNK_LENGTH: 180
      CROP_FACE:
        DO_CROP_FACE: True
        USE_LARGE_FACE_BOX: True
        LARGE_BOX_COEF: 1.5
        DETECTION:
          DO_DYNAMIC_DETECTION: False
          DYNAMIC_DETECTION_FREQUENCY : 30
          USE_MEDIAN_FACE_BOX: False    # This should be used ONLY if dynamic detection is used
      RESIZE:
        H: 72
        W: 72

and I modified code a bit that test_loader will be run first and generate test_file_list, and then train_loader and val_loder will be run sequentially so that it will preprocess the whole dataset first and generate train_file_list and val_file_list later without needing to preprocessing the data again.

The bug happens after it preprocessed the whole dataset and generated the test_file_list. Since I specified the train and val data do not need DO_PREPROCESS, it will call build_file_list_retroactive() (the preprocessed file has been generated in the previous step), and then the mentioned bug will happen.

Can you try reproducing this bug without the changes you mentioned you made to test_loader, or elsewhere? I can't seem to reproduce this bug on my end, and I'm curious if something you might've changed locally might be causing it or if maybe there is some other small difference I'm not noticing.

Also, I'm a bit confused on two more things:

1) What is the config file you mentioned in your last reply supposed to be for? It doesn't seem like it's for proper intra-dataset training since it wouldn't make sense to use 80% for training, 20% for validation of the same data, and then just test on 100% of the same data. If that's supposed to be an intra-dataset config file, you should train on 60%, validate on 20%, and test on 20%, or just train on 80% and test on 20% without any kind of validation split if you end up not using validation.

2) Are you sure you have the latest version of the repo and still had to make the changes you mentioned you made? I'm especially confused since my understanding was that, if you did a proper intra-dataset split like 60%, 20%, 20%, no data would have to be preprocessed for a second time when preprocessing either of the latter two splits. If you wanted, you could even preprocess 100% of the MMPD data in the train split with certain filtering parameters (e.g., skin tones), and then use 100% for the subsequent splits with different filtering parameters (if for some reason you were alright with having the same subjects between splits, but different lighting conditions).

Generally I recommend using splits where subjects will be unique to each split and, for MMPD in particular, you can filter in such a way that you only sub-select certain for certain videos (e.g., with certain lighting, skin tone corresponding to the subject, etc).

I know my config looks strange for intra-dataset training, but it is not the aim... In a word, my understanding is that I first preprocess the 100% data, and then the train/val split preprocessing will only involve file_list generation without the need to process the video data again.
I am sure I am using the latest version.

It is strange that you cannot reproduce the bug, cuz I thought it should happen as long as build_file_list_retroactive() is called for MMPD. Anyway, I copy my data preprocessing code here which was modified based on main.py, since I want to separate data-preprocessing from training pipeline. You can run this by python preprocess_dataset.py --config_file CONFIG_NAME.yaml. The config file is the one I pasted in the previous reply.

""" 
preprocess_dataset.py
The main function of rPPG dataset preprocessing.
"""
import os
import sys
import random
import argparse

import torch
import numpy as np

from config import get_config
from dataset import data_loader

RANDOM_SEED = 100
torch.manual_seed(RANDOM_SEED)
torch.cuda.manual_seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Create a general generator for use with the validation dataloader,
# the test dataloader, and the unsupervised dataloader
general_generator = torch.Generator()
general_generator.manual_seed(RANDOM_SEED)
# Create a training generator to isolate the train dataloader from
# other dataloaders and better control non-deterministic behavior
train_generator = torch.Generator()
train_generator.manual_seed(RANDOM_SEED)

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

def add_args(parser):
    """Adds arguments for parser."""
    parser.add_argument(
        "--config_file",
        required=False,
        default="configs/train_configs/PURE_PURE_UBFC-rPPG_TSCAN_BASIC.yaml",
        type=str,
        help="The name of the model.",
    )
    """Neural Method Sample YAML LIST:
      SCAMPS_SCAMPS_UBFC-rPPG_TSCAN_BASIC.yaml
      SCAMPS_SCAMPS_UBFC-rPPG_DEEPPHYS_BASIC.yaml
      SCAMPS_SCAMPS_UBFC-rPPG_PHYSNET_BASIC.yaml
      SCAMPS_SCAMPS_PURE_DEEPPHYS_BASIC.yaml
      SCAMPS_SCAMPS_PURE_TSCAN_BASIC.yaml
      SCAMPS_SCAMPS_PURE_PHYSNET_BASIC.yaml
      PURE_PURE_UBFC-rPPG_TSCAN_BASIC.yaml
      PURE_PURE_UBFC-rPPG_DEEPPHYS_BASIC.yaml
      PURE_PURE_UBFC-rPPG_PHYSNET_BASIC.yaml
      PURE_PURE_MMPD_TSCAN_BASIC.yaml
      UBFC-rPPG_UBFC-rPPG_PURE_TSCAN_BASIC.yaml
      UBFC-rPPG_UBFC-rPPG_PURE_DEEPPHYS_BASIC.yaml
      UBFC-rPPG_UBFC-rPPG_PURE_PHYSNET_BASIC.yaml
      MMPD_MMPD_UBFC-rPPG_TSCAN_BASIC.yaml
    Unsupervised Method Sample YAML LIST:
      PURE_UNSUPERVISED.yaml
      UBFC-rPPG_UNSUPERVISED.yaml
    """
    return parser

if __name__ == "__main__":
    # parse arguments.
    parser = argparse.ArgumentParser()
    parser = add_args(parser)
    parser = data_loader.BaseLoader.BaseLoader.add_data_loader_args(parser)
    args = parser.parse_args()

    # configurations.
    config = get_config(args)
    print("Configuration:")
    print(config, end="\n\n")

    # test_loader
    if config.TEST.DATA.DATASET == "UBFC-rPPG":
        test_loader = data_loader.UBFCrPPGLoader.UBFCrPPGLoader
    elif config.TEST.DATA.DATASET == "PURE":
        test_loader = data_loader.PURELoader.PURELoader
    elif config.TEST.DATA.DATASET == "SCAMPS":
        test_loader = data_loader.SCAMPSLoader.SCAMPSLoader
    elif config.TEST.DATA.DATASET == "MMPD":
        test_loader = data_loader.MMPDLoader.MMPDLoader
    elif config.TEST.DATA.DATASET == "BP4DPlus":
        test_loader = data_loader.BP4DPlusLoader.BP4DPlusLoader
    elif config.TEST.DATA.DATASET == "BP4DPlusBigSmall":
        test_loader = data_loader.BP4DPlusBigSmallLoader.BP4DPlusBigSmallLoader
    elif config.TEST.DATA.DATASET == "UBFC-PHYS":
        test_loader = data_loader.UBFCPHYSLoader.UBFCPHYSLoader
    elif config.TEST.DATA.DATASET is None:
        print("Test dataset prerpocessing is skipped.", end="\n\n")
    else:
        raise ValueError(
            "Unsupported dataset! Currently supporting UBFC-rPPG, PURE, MMPD, \
                            SCAMPS, BP4D+ (Normal and BigSmall preprocessing), and UBFC-PHYS."
        )

    # Create and initialize the test dataloader.
    test_data = test_loader(
        name="test", data_path=config.TEST.DATA.DATA_PATH, config_data=config.TEST.DATA
    )

    # train_loader
    if config.TRAIN.DATA.DATASET == "UBFC-rPPG":
        train_loader = data_loader.UBFCrPPGLoader.UBFCrPPGLoader
    elif config.TRAIN.DATA.DATASET == "PURE":
        train_loader = data_loader.PURELoader.PURELoader
    elif config.TRAIN.DATA.DATASET == "SCAMPS":
        train_loader = data_loader.SCAMPSLoader.SCAMPSLoader
    elif config.TRAIN.DATA.DATASET == "MMPD":
        train_loader = data_loader.MMPDLoader.MMPDLoader
    elif config.TRAIN.DATA.DATASET == "BP4DPlus":
        train_loader = data_loader.BP4DPlusLoader.BP4DPlusLoader
    elif config.TRAIN.DATA.DATASET == "BP4DPlusBigSmall":
        train_loader = data_loader.BP4DPlusBigSmallLoader.BP4DPlusBigSmallLoader
    elif config.TRAIN.DATA.DATASET == "UBFC-PHYS":
        train_loader = data_loader.UBFCPHYSLoader.UBFCPHYSLoader
    elif config.TRAIN.DATA.DATASET is None:
        print("Train dataset prerpocessing is skipped.", end="\n\n")
    else:
        raise ValueError(
            "Unsupported dataset! Currently supporting UBFC-rPPG, PURE, MMPD, \
                            SCAMPS, BP4D+ (Normal and BigSmall preprocessing), and UBFC-PHYS."
        )

    # Create and initialize the train dataloader
    train_data_loader = train_loader(
        name="train", data_path=config.TRAIN.DATA.DATA_PATH, config_data=config.TRAIN.DATA
    )

    # valid_loader
    if config.VALID.DATA.DATASET == "UBFC-rPPG":
        valid_loader = data_loader.UBFCrPPGLoader.UBFCrPPGLoader
    elif config.VALID.DATA.DATASET == "PURE":
        valid_loader = data_loader.PURELoader.PURELoader
    elif config.VALID.DATA.DATASET == "SCAMPS":
        valid_loader = data_loader.SCAMPSLoader.SCAMPSLoader
    elif config.VALID.DATA.DATASET == "MMPD":
        valid_loader = data_loader.MMPDLoader.MMPDLoader
    elif config.VALID.DATA.DATASET == "BP4DPlus":
        valid_loader = data_loader.BP4DPlusLoader.BP4DPlusLoader
    elif config.VALID.DATA.DATASET == "BP4DPlusBigSmall":
        valid_loader = data_loader.BP4DPlusBigSmallLoader.BP4DPlusBigSmallLoader
    elif config.VALID.DATA.DATASET == "UBFC-PHYS":
        valid_loader = data_loader.UBFCPHYSLoader.UBFCPHYSLoader
    elif config.VALID.DATA.DATASET is None:
        print("Validation dataset prerpocessing is skipped.", end="\n\n")
    else:
        raise ValueError(
            "Unsupported dataset! Currently supporting UBFC-rPPG, PURE, MMPD, \
                            SCAMPS, BP4D+ (Normal and BigSmall preprocessing), and UBFC-PHYS."
        )

    # Create and initialize the valid dataloader.
    valid_data = valid_loader(
        name="valid", data_path=config.VALID.DATA.DATA_PATH, config_data=config.VALID.DATA
    )

Hi @Dylan-H-Wang,

I think I understand what you're saying now, but I should stress this seems more like a feature request than a bug per say. The base functionality you're describing as a bug actually works as intended - per preprocessing run, no existing preprocessed data would be utilized and data will always be preprocessed (as if for the first time) alongside a new data file list being generated. If there isn't an explicit data file list generated from such a preprocessing run, you have to either re-preprocess the data or just create and/or modify a data file list.

For your purposes, if you wanted, you could preprocess 100% of the dataset and then create the data file lists yourself using the instructions in this section of the README. We can of course make this easier to do in the future by automating the actual data file list generation process while noting data that has already been preprocessed, if the user specifies that.

I'm going to go ahead and close this issue while noting this as a possible feature in the future, but let me know if I somehow still misunderstood the problem at hand.

ubicomplab / rPPG-Toolbox

Bugs for processing MMPD dataset when the `DO_PREPROCESS` is set to False #210