jkbdnj / jakub-dunaj-bachelors-thesis

Bachelor's thesis project on Plant Disease Classification for Vienna university of Technology. Includes LaTeX document and accompanying web application.
0 stars 0 forks source link

Create a script dividing the initial dataset into training and testing subsets #5

Closed jkbdnj closed 1 week ago

jkbdnj commented 2 weeks ago

Task description

It is necessary to create a script that divides the initial dataset into trainig and testing subsets before the training process. The current initial dataset has the following structure:

initial_dataset/
+---color/
|       +---Apple___Apple_scab
|       +---Apple___Black_rot
|       +---Apple___Cedar_apple_rust
|       \---...
+---segmented/
|       +---Apple___Apple_scab
|       +---Apple___Black_rot
|       +---Apple___Cedar_apple_rust
|       \---...
\---artificial_background/
        +---Apple___Apple_scab
        +---Apple___Black_rot
        +---Apple___Cedar_apple_rust
        \---...

The target strucutre of the initial dataset is:

final_dataset/
+---train/
|       +---Apple___Apple_scab
|       +---Apple___Black_rot
|       +---Apple___Cedar_apple_rust
|       \---...
\---test/
        +---Apple___Apple_scab
        +---Apple___Black_rot
        +---Apple___Cedar_apple_rust
        \---...

With such a preprocessed dataset, it is not always needed to divide and reorder the dataset prior to the training process. It is true, that the Keras has pretty efficient dataset loading functionality. But with a divided dataset the work is much easier. The most important thing is that the ratio of coloar/segmented/artificial_background images is maintained in every class of training and testing subsets.

jkbdnj commented 2 weeks ago

NOTE: The way to do it with Keras would be in the case of the training subset:

# returns tf.data.Dataset object
training_dataset_color = tf.keras.utils.image_dataset_from_directory(
    "PATHO_TO/initial_dataset/color",
    seed=1234,
    validation_split=0.2,
    subset="training",
    batch_size = 32)

training_dataset_segmented = tf.keras.utils.image_dataset_from_directory(
    "PATHO_TO/initial_dataset/segmented",
    seed=1234,
    validation_split=0.2,
    subset="training",
    batch_size = 32)

training_dataset_augmented_backgrounds = tf.keras.utils.image_dataset_from_directory(
    "PATHO_TO/initial_dataset/augmented_backgrounds",
    seed=1234,
    validation_split=0.2,
    subset="training",
    batch_size = 32)

# concatenates the batches
training_dataset = training_dataset_color.concatenate(training_dataset_segmented).concatenate(training_dataset_artificial_background)