rses-dl-course / rses-dl-course.github.io

Other
8 stars 4 forks source link

Reassigning tf.dataset commands to a new argument name #17

Closed EdwinB12 closed 1 year ago

EdwinB12 commented 1 year ago

Example is Lab 02 - Preprocess the data

def normalize(images, labels):
  images = tf.cast(images, tf.float32)
  images /= 255
  return images, labels

# The map function applies the normalize function to each element in the train
# and test datasets
train_dataset =  train_dataset.map(normalize)
test_dataset  =  test_dataset.map(normalize)

# The first time you use the dataset, the images will be loaded from disk
# Caching will keep them in memory, making training faster
train_dataset =  train_dataset.cache()
test_dataset  =  test_dataset.cache()

Rerunning this cell would throw no exceptions further down the line but would signifcantly damage the performance as a result of double normalisation.

EdwinB12 commented 1 year ago

I see three (similar) options:

  1. Create a new variable name each time:
# The map function applies the normalize function to each element in the train
# and test datasets
train_dataset_norm =  train_dataset.map(normalize)
test_dataset_norm  =  test_dataset.map(normalize)

# The first time you use the dataset, the images will be loaded from disk
# Caching will keep them in memory, making training faster
train_dataset_cache =  train_dataset_norm.cache()
test_dataset_cache  =  test_dataset_norm.cache()
  1. Wrap in a function to reduce number of variables being created:

def dataset_pipeline(train_ds, test_ds):

The map function applies the normalize function to each element in the train

and test datasets

train_ds= train_ds.map(normalize) test_ds= test_ds.map(normalize)

The first time you use the dataset, the images will be loaded from disk

Caching will keep them in memory, making training faster

train_ds= train_ds.cache() test_ds = test_ds.cache()

return train_ds, test_ds

train_dataset_processed, test_dataset_processed = dataset_pipeline(train_dataset, test_dataset)


3. Chaining

The map function applies the normalize function to each element in the train and test datasets.

The first time you use the dataset, the images will be loaded from disk

Caching will keep them in memory, making training faster

train_dataset_processed = train_dataset.map(normalize).cache() test_dataset_processed = test_dataset.map(normalize).cache()



I think chaining is the most elegant solution but is it slightly unclear to some beginners?