Create a script dividing the initial dataset into training and testing subsets

jkbdnj / jakub-dunaj-bachelors-thesis

Bachelor's thesis project on Plant Disease Classification for Vienna university of Technology. Includes LaTeX document and accompanying web application.

0 stars 0 forks source link

Task description

It is necessary to create a script that divides the initial dataset into trainig and testing subsets before the training process. The current initial dataset has the following structure:

initial_dataset/
+---color/
|       +---Apple___Apple_scab
|       +---Apple___Black_rot
|       +---Apple___Cedar_apple_rust
|       \---...
+---segmented/
|       +---Apple___Apple_scab
|       +---Apple___Black_rot
|       +---Apple___Cedar_apple_rust
|       \---...
\---artificial_background/
        +---Apple___Apple_scab
        +---Apple___Black_rot
        +---Apple___Cedar_apple_rust
        \---...

The target strucutre of the initial dataset is:

final_dataset/
+---train/
|       +---Apple___Apple_scab
|       +---Apple___Black_rot
|       +---Apple___Cedar_apple_rust
|       \---...
\---test/
        +---Apple___Apple_scab
        +---Apple___Black_rot
        +---Apple___Cedar_apple_rust
        \---...

With such a preprocessed dataset, it is not always needed to divide and reorder the dataset prior to the training process. It is true, that the Keras has pretty efficient dataset loading functionality. But with a divided dataset the work is much easier. The most important thing is that the ratio of coloar/segmented/artificial_background images is maintained in every class of training and testing subsets.

# returns tf.data.Dataset object training_dataset_color = tf.keras.utils.image_dataset_from_directory( "PATHO_TO/initial_dataset/color", seed=1234, validation_split=0.2, subset="training", batch_size = 32) training_dataset_segmented = tf.keras.utils.image_dataset_from_directory( "PATHO_TO/initial_dataset/segmented", seed=1234, validation_split=0.2, subset="training", batch_size = 32) training_dataset_augmented_backgrounds = tf.keras.utils.image_dataset_from_directory( "PATHO_TO/initial_dataset/augmented_backgrounds", seed=1234, validation_split=0.2, subset="training", batch_size = 32) # concatenates the batches training_dataset = training_dataset_color.concatenate(training_dataset_segmented).concatenate(training_dataset_artificial_background)

jkbdnj / jakub-dunaj-bachelors-thesis

Create a script dividing the initial dataset into training and testing subsets #5

Task description