Closed czarmanu closed 4 months ago
You can change kf, k_folder,
in train_args
for cross-validation training
e.g., you can set kf=0, k_folder=5
, means you use 5-folder for validation, the current training loop using folder0
, Note kf < k_folder
, if kf=k_folder=0
as defaut, means no CV training.
To train with cross-validation, you need to manually set kf
from 0, 1, ...,
til k_folder-1 (e.g., 4)
train_args = agriculture_configs(net_name='MSCG-Rx50',
data='Agriculture',
bands_list=['NIR', 'RGB'],
kf=0, k_folder=5, # change kf to 0,1,2,3,4 for CV training
note='reproduce_ACW_loss2_adax'
)
kf =0 and k_folder=5 did not automatically create 20% of val set from the train set and threw the following error:
----------creating groundtruth data for training./.val---------------
Traceback (most recent call last):
File "/scratch/manu/MSCG-Net-master_selftrained/./tools/train_ethz.py", line 29, in
Does that mean a dummy val set need to be still provided?
reminder :)
This split is specially designed for agriculture-vision dataset, that only split the offcial val-set by kfolders, not split the training set. If you want only split train set into train and val set, you need to modify the function split_train_val_test_sets
like;
def split_train_val_test_sets(data_folder=Data_Folder, name='Agriculture', bands=['NIR','RGB'], KF=3, k=1, seeds=69278):
train_id, t_list = get_training_list(root_folder=TRAIN_ROOT, count_label=False)
VAL_ROOT = TRAIN_ROOT
# val_id, v_list = get_training_list(root_folder=VAL_ROOT, count_label=False) # if you dont have val folder
if KF >=2:
kf = KFold(n_splits=KF, shuffle=True, random_state=seeds)
val_ids = np.array(v_list)
idx = list(kf.split(np.array(val_ids)))
if k >= KF: # k should not be out of KF range, otherwise set k = 0
k = 0
t2_list, v_list = list(val_ids[idx[k][0]]), list(val_ids[idx[k][1]])
else:
print("k folders shoule be larger than k")
return -1
img_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])
val_folders = [os.path.join(data_folder[name]['ROOT'], 'val', data_folder[name][band]) for band in bands]
val_gt_folder = os.path.join(data_folder[name]['ROOT'], 'val', data_folder[name]['GT'])
train_dict = {
IDS: train_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in t2_list],
GT: [gt_folder.format(id) for id in t_list] + [val_gt_folder.format(id) for id in t2_list],
'all_files': t2_list
}
val_dict = {
IDS: val_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
GT: [val_gt_folder.format(id) for id in v_list],
'all_files': v_list
}
# here set test_dict = val_dict, not real test-set
test_dict = {
IDS: val_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
GT: [val_gt_folder.format(id) for id in v_list],
}
print('train set -------', len(train_dict[GT]))
print('val set ---------', len(val_dict[GT]))
return train_dict, val_dict, test_dict
Is there a way to deactivate the val set and use only train and test sets?
I'm not sure what your point is. If you intended to use test-set servering as the val-set, simply change the val-folder to the test-folder.
If I use test set as val set in the iterative training process, the model will overfit on this val set. What I want is a model trained agnostic of the test set using either a setting with no val set (during training) or a val set which is a subset of the train set itself (created randomly and automaticall and not manually before teh training starts). I hope I didn't confuse you.
Now I see, lets saying you have a training set with 100 images, and a test-set with 50 images, you can split the training set into e.g., train/val 80/20 randomly with 5 folders, then you can train your model with these 5-folder to get 5 best ckpt weights, then you can test your trained 5 ckpt weights in the test-set (50 images) seperately or ensemblling some or all of them as you like. This is the most common way.
Another way as you said, 'no-val-set', you train you model with all 100 images without validation, however, if you don't validate the model during trainiing, you still need to save the best ckpt weighting every a certain number of epoch (e.g., 200 epoches, etc) either based on the best loss or best metric (e.g., F1) evaluted on all 100 images (using whole train-set itself as val-set), or evaluted on a randomly seleted part of the 100 images. If so, you need to modify your training pipeline accordingly. I think it's possible and not hard to implement.
e.g., train/val 80/20 randomly with 5 folders, then you can train your model with these 5-folder to get 5 best ckpt weights, then you can test your trained 5 ckpt weights in the test-set (50 images) seperately or ensemblling some or all of them as you like. This is the most common way.
Yes,I want this option but the 80/20 split must be done automatically and not manually. Is that possible with the current code?
Yes, it's possible, just slightly need to change the code split_train_val_test_sets
as following ( might be some bugs, you can further modify it).
# change DATASET ROOT to your dataset path
DATASET_ROOT = '/media/liu/diskb/data/Agriculture-Vision'
TRAIN_ROOT = os.path.join(DATASET_ROOT, 'train')
def split_train_val_test_sets(data_folder=Data_Folder, name='Agriculture', bands=['NIR','RGB'], KF=5, k=0, seeds=69278):
train_id, t_list = get_training_list(root_folder=TRAIN_ROOT, count_label=False)
VAL_ROOT = TRAIN_ROOT
if KF >2: # must to be larger than 1
kf = KFold(n_splits=KF, shuffle=True, random_state=seeds)
val_ids = np.array(t_list)
idx = list(kf.split(val_ids))
if k >= KF: # k should not be out of KF range, otherwise set k = 0
k = 0
tr_list, v_list = list(val_ids[idx[k][0]]), list(val_ids[idx[k][1]])
else:
print("k folders shoule be larger than k")
return -1
img_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])
val_folders = [os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name][band]) for band in bands]
val_gt_folder = os.path.join(data_folder[name]['ROOT'], 'train', data_folder[name]['GT'])
train_dict = {
IDS: train_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in tr_list],
GT: [val_gt_folder.format(id) for id in tr_list],
'all_files': tr_list
}
val_dict = {
IDS: val_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
GT: [val_gt_folder.format(id) for id in v_list],
'all_files': v_list
}
# here set test_dict = val_dict, not real test-set
test_dict = {
IDS: val_id,
IMG: [[val_folder.format(id) for val_folder in val_folders] for id in v_list],
GT: [val_gt_folder.format(id) for id in v_list],
}
print('train set -------', len(train_dict[GT]))
print('val set ---------', len(val_dict[GT]))
return train_dict, val_dict, test_dict
I will try with the above. But, I did not understand why you declared the val and test dicts similarly?
test_dict is not used at all during training, you can safely deleted it if you want, and just return train_dict, and val_dict. I leave test_dict here just for furture modification and debug on real test set etc.
the code was not written well and contains some confused names / redundant stuff / bugs without refactoring after completing the agriculture workshop. You need to choose the useful part and rewrite them as you want.
Is there an option to play around with the size of the train and val sets. For example, use x% of the train set as val set, instead of predefining both train and val sets manually at the beginning?