Closed EdwinB12 closed 1 year ago
I see three (similar) options:
# The map function applies the normalize function to each element in the train
# and test datasets
train_dataset_norm = train_dataset.map(normalize)
test_dataset_norm = test_dataset.map(normalize)
# The first time you use the dataset, the images will be loaded from disk
# Caching will keep them in memory, making training faster
train_dataset_cache = train_dataset_norm.cache()
test_dataset_cache = test_dataset_norm.cache()
def dataset_pipeline(train_ds, test_ds):
train_ds= train_ds.map(normalize) test_ds= test_ds.map(normalize)
train_ds= train_ds.cache() test_ds = test_ds.cache()
return train_ds, test_ds
train_dataset_processed, test_dataset_processed = dataset_pipeline(train_dataset, test_dataset)
3. Chaining
train_dataset_processed = train_dataset.map(normalize).cache() test_dataset_processed = test_dataset.map(normalize).cache()
I think chaining is the most elegant solution but is it slightly unclear to some beginners?
Example is Lab 02 - Preprocess the data
Rerunning this cell would throw no exceptions further down the line but would signifcantly damage the performance as a result of double normalisation.