Multiple train jobs at the same time might affect each other's cached data

valentinitnelav commented 2 years ago

I need to be careful when I run multiple models at the same time using the same dataset because each type of yolo generated its own .cache file and each job deletes existing ones so to force a start clean.

Not sure also for example which yolo version generated the labels.cache3 or if this happens because of multiple train jobs running at the same time and not sure how this renaming can impact each job.

This can be solved by copying the data into different folders, so to have a data folder for each type of detector. But does this mean that each train job should have its data folder? This starts to be fast unpractical with many different yolo models that we want to test.

valentinitnelav commented 2 years ago

I just discovered that labels.cache3 in train and val folders are produced during the training of PyTorch_YOLOv4. YOLOv7 & v5 produce labels.cache files. So, running v5 & v7 jobs in parallel using the exact same data path might be problematic.

valentinitnelav commented 2 years ago

I decided is safer to have an extra copy of the dataset ready for training:

one for YOLOv7 & v4 (because they produce different cache files: labels.cache & labels.cache3 respectively);
another one for YOLOv5 train jobs

In this way, I hope I can run train jobs at the same time without one YOLO version affecting the cache files of the other.

This implies however having a data.yaml file for each type of YOLO version.

FYI: See #28 where I documented the --cache option behavior for YOLOv5

valentinitnelav commented 2 years ago

A first attempt to avoid this was by creating a file /scripts/data_yolov5.yaml pointing to a copy of data folder P1_Data_sampled, named P1_Data_sampled_yolov5.

All the YOLov5 jobs will use this yaml file for --data. YOLOv4 & v7 will use the existing one scripts/config_yolov5.yaml, which most probably should be renamed for clarity to data_config.yaml

The yaml file is just local on the cluster for now and its content is:

# Path to sampled dataset
path: /home/sc.uni-leipzig.de/sv127qyji/datasets/P1_Data_sampled_yolov5
train: /home/sc.uni-leipzig.de/sv127qyji/datasets/P1_Data_sampled_yolov5/train/images
val: /home/sc.uni-leipzig.de/sv127qyji/datasets/P1_Data_sampled_yolov5/val/images

# Classes
nc: 8  # number of classes
names: [ 'Araneae', 'Coleoptera', 'Diptera', 'Hemiptera', 'Hymenoptera_Formicidae', 'Hymenoptera', 'Lepidoptera', 'Orthoptera' ]  # class names. Must be title case and the order must respect the order in P1_Data img_* folders.

stark-t / PAI

Multiple train jobs at the same time might affect each other's cached data #46