Automatic train-val split

czarmanu commented 2 years ago

Is there an option to play around with the size of the train and val sets. For example, use x% of the train set as val set, instead of predefining both train and val sets manually at the beginning?

samleoqh commented 2 years ago

You can change kf, k_folder,in train_args for cross-validation training e.g., you can set kf=0, k_folder=5, means you use 5-folder for validation, the current training loop using folder0, Note kf < k_folder, if kf=k_folder=0as defaut, means no CV training.

To train with cross-validation, you need to manually set kf from 0, 1, ..., til k_folder-1 (e.g., 4)

train_args = agriculture_configs(net_name='MSCG-Rx50',
                                 data='Agriculture',
                                 bands_list=['NIR', 'RGB'],
                                 kf=0,  k_folder=5,                      # change kf to 0,1,2,3,4 for CV training 
                                 note='reproduce_ACW_loss2_adax'
                                 )

czarmanu commented 2 years ago

kf =0 and k_folder=5 did not automatically create 20% of val set from the train set and threw the following error:

----------creating groundtruth data for training./.val--------------- Traceback (most recent call last): File "/scratch/manu/MSCG-Net-master_selftrained/./tools/train_ethz.py", line 29, in prepare_gt(VAL_ROOT) File "/scratch/manu/MSCG-Net-master_selftrained/data/AgricultureVision/pre_process.py", line 82, in prepare_gt check_mkdir(os.path.join(root_folder, out_path)) File "/scratch/manu/MSCG-Net-master_selftrained/data/AgricultureVision/pre_process.py", line 68, in check_mkdir os.mkdir(dir_name) FileNotFoundError: [Errno 2] No such file or directory: '/home/pf/pfstaff/projects/mtom/mTom_Sat_Data_Fusion/Comparison/supervised/ethz_dataset/val/gt'

Does that mean a dummy val set need to be still provided?

czarmanu commented 2 years ago

reminder :)

samleoqh commented 2 years ago

This split is specially designed for agriculture-vision dataset, that only split the offcial val-set by kfolders, not split the training set. If you want only split train set into train and val set, you need to modify the function split_train_val_test_sets like;

def split_train_val_test_sets(data_folder=Data_Folder, name='Agriculture', bands=['NIR','RGB'], KF=3, k=1, seeds=69278):

    train_id, t_list = get_training_list(root_folder=TRAIN_ROOT, count_label=False)
    VAL_ROOT = TRAIN_ROOT
    # val_id, v_list = get_training_list(root_folder=VAL_ROOT, count_label=False) # if you dont have val folder

    if KF >=2:
        kf = KFold(n_splits=KF, shuffle=True, random_state=seeds)
        val_ids = np.array(v_list)
        idx = list(kf.split(np.array(val_ids)))
        if k >= KF:  # k should not be out of KF range, otherwise set k = 0
            k = 0
        t2_list, v_list = list(val_ids[idx[k][0]]), list(val_ids[idx[k][1]])
   else:
        print("k folders shoule be larger than k")
        return -1

    img_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
    gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])

    val_folders = [os.path.join(data_folder[name]['ROOT'], 'val', data_folder[name][band]) for band in bands]
    val_gt_folder = os.path.join(data_folder[name]['ROOT'], 'val', data_folder[name]['GT'])

    train_dict = {
        IDS: train_id,
        IMG: [[val_folder.format(id) for val_folder in val_folders] for id in t2_list],
        GT: [gt_folder.format(id) for id in t_list] + [val_gt_folder.format(id) for id in t2_list],
        'all_files': t2_list
    }

    val_dict = {
        IDS: val_id,
        IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
        GT: [val_gt_folder.format(id) for id in v_list],
        'all_files': v_list
    }

    # here set test_dict = val_dict, not real test-set
    test_dict = {
        IDS: val_id,
        IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
        GT: [val_gt_folder.format(id) for id in v_list],
    }

    print('train set -------', len(train_dict[GT]))
    print('val set ---------', len(val_dict[GT]))
    return train_dict, val_dict, test_dict

czarmanu commented 2 years ago

Is there a way to deactivate the val set and use only train and test sets?

samleoqh commented 2 years ago

I'm not sure what your point is. If you intended to use test-set servering as the val-set, simply change the val-folder to the test-folder.

czarmanu commented 2 years ago

If I use test set as val set in the iterative training process, the model will overfit on this val set. What I want is a model trained agnostic of the test set using either a setting with no val set (during training) or a val set which is a subset of the train set itself (created randomly and automaticall and not manually before teh training starts). I hope I didn't confuse you.

samleoqh commented 2 years ago

Now I see, lets saying you have a training set with 100 images, and a test-set with 50 images, you can split the training set into e.g., train/val 80/20 randomly with 5 folders, then you can train your model with these 5-folder to get 5 best ckpt weights, then you can test your trained 5 ckpt weights in the test-set (50 images) seperately or ensemblling some or all of them as you like. This is the most common way.

Another way as you said, 'no-val-set', you train you model with all 100 images without validation, however, if you don't validate the model during trainiing, you still need to save the best ckpt weighting every a certain number of epoch (e.g., 200 epoches, etc) either based on the best loss or best metric (e.g., F1) evaluted on all 100 images (using whole train-set itself as val-set), or evaluted on a randomly seleted part of the 100 images. If so, you need to modify your training pipeline accordingly. I think it's possible and not hard to implement.

czarmanu commented 2 years ago

e.g., train/val 80/20 randomly with 5 folders, then you can train your model with these 5-folder to get 5 best ckpt weights, then you can test your trained 5 ckpt weights in the test-set (50 images) seperately or ensemblling some or all of them as you like. This is the most common way.

Yes,I want this option but the 80/20 split must be done automatically and not manually. Is that possible with the current code?

samleoqh commented 2 years ago

Yes, it's possible, just slightly need to change the code split_train_val_test_sets as following ( might be some bugs, you can further modify it).

# change DATASET ROOT to your dataset path
DATASET_ROOT = '/media/liu/diskb/data/Agriculture-Vision'

TRAIN_ROOT = os.path.join(DATASET_ROOT, 'train')

def split_train_val_test_sets(data_folder=Data_Folder, name='Agriculture', bands=['NIR','RGB'], KF=5, k=0, seeds=69278):

    train_id, t_list = get_training_list(root_folder=TRAIN_ROOT, count_label=False)
    VAL_ROOT = TRAIN_ROOT

    if KF >2:  # must to be larger than 1
        kf = KFold(n_splits=KF, shuffle=True, random_state=seeds)
        val_ids = np.array(t_list)
        idx = list(kf.split(val_ids))
        if k >= KF:  # k should not be out of KF range, otherwise set k = 0
            k = 0
        tr_list, v_list = list(val_ids[idx[k][0]]), list(val_ids[idx[k][1]])
   else:
        print("k folders shoule be larger than k")
        return -1

    img_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
    gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])

    val_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
    val_gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])

    train_dict = {
        IDS: train_id,
        IMG: [[val_folder.format(id) for val_folder in val_folders] for id in tr_list],
        GT: [val_gt_folder.format(id) for id in tr_list],
        'all_files': tr_list
    }

    val_dict = {
        IDS: val_id,
        IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
        GT: [val_gt_folder.format(id) for id in v_list],
        'all_files': v_list
    }

    # here set test_dict = val_dict, not real test-set
    test_dict = {
        IDS: val_id,
        IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
        GT: [val_gt_folder.format(id) for id in v_list],
    }

    print('train set -------', len(train_dict[GT]))
    print('val set ---------', len(val_dict[GT]))
    return train_dict, val_dict, test_dict

czarmanu commented 2 years ago

I will try with the above. But, I did not understand why you declared the val and test dicts similarly?

samleoqh commented 2 years ago

test_dict is not used at all during training, you can safely deleted it if you want, and just return train_dict, and val_dict. I leave test_dict here just for furture modification and debug on real test set etc.

samleoqh commented 2 years ago

the code was not written well and contains some confused names / redundant stuff / bugs without refactoring after completing the agriculture workshop. You need to choose the useful part and rewrite them as you want.

samleoqh / MSCG-Net

Automatic train-val split #23