Performance comparison of make_reader() & make_petastorm_dataset() vs make_spark_converter() & make_tf_dataset()

From the API user guide, it seems that there are two different way of using Petastorm to train tensorflow models

Using make_reader() or make_batch_reader() and then using make_petastorm_dataset() to create a tf.data object
Using make_spark_converter() to materialize the dataset and then using converter.make_tf_dataset() to create a tf.data object

All things equal, which of these would be expected to have faster performance? I know that option 1 reads from a file path while option 2 starts with a spark dataframe. Option 2 seems simpler, but is there a loss of performance that would be associated with it?

Thanks

Sorry for a delayed response. All these options are different:

make_reader is a Python native interface (nothing TF or Torch specific). It works with parquet stores that were created using with materialize_dataset(..)... context manager. These datasets support tensor types (unlike native parquet stores)
make_batch_reader is also a native Python reader. Can be used to read a parquet store that is typically produced by some other system. Basically a standard parquet store.
make_spark_converter would do two things for you: store a pyspark dataframe into a Parquet store and create make_batch_reader on top of it. This is a convenience function. Since it would also write out this temporary parquet store, it will take more time to initialize. Reading the data from the temporary store (once it was created) is as fast as any other reader.

You might find additional information: https://petastorm.readthedocs.io/en/latest

No problem, I appreciate you taking the time to answer.

I understand those differences, and was wondering if there would be any performance effect between those methods in terms of reading only. Judging by your last line, it appears that this is not the case, i.e. other than initialization times, the methods are equivalent in terms of speed.

I am asking because I am trying to use petastorm using the make_spark_converter -> make_tf_dataset() paradigm, and I am seeing extremely slow training speeds (50x+ slower than when training on a local tf dataset) and was hoping to learn how to improve that performance.

I can post my code and questions in a new issue and we can close this one (unless I should simply post them here).

Thank you!

make_spark_converter -> make_tf_dataset uses make_batch_reader+make_petastorm_dataset underneath (to read from the temporary parquet store created).

Can you please provide more information on the slowdown?

50x slowdown: is this a 50x slowdown as measured for each fwd/bwd propagation iteration?
What kind of IO method you use for your baseline (when talking about 50x slowdown - comparing to what?). Could there be additional differences between your IO methods?
What is the data distribution? How many fields in your row? How many rows in a rowgroup?

It would be best if you could distill a small example I could actually run and profile. It might be hard to see the issue just from the code as it's likely about the combination of the code and the data structure underneath.

Yes. I'm simply comparing how long it takes to complete a single epoch.
Baseline method: start with spark dataframe, convert it to a pandas dataframe (I'm using a small sample of the data to do this testing, real data would not fit in memory), then convert that pandas DF to a tensorflow data set using tf.data.Dataset.from_tensor_slices(). Batch and prefetch that and then train on the data using model.fit(). The field
The data I am doing performance testing on has 16.6M rows (full data has more) and ~40 columns (30 numeric and 6 string, with a couple of ID columns that the model currently ignores). The full data has more like 200M rows and ~250 columns (almost all numeric). Would it be misleading to test performance on this smaller subset of the data?
I don't know what the row-group size is, how can I check that?

Below is a simplified version of my current approach, training from a local tf.dataset and then using petastorm. Hopefully it might be helpful. I can also try running it with a simpler model architecture (using spark to create a single feature column and simply training on that).

What would be the best way to provide a workable example? I could provide a sample of the data as a csv (would have to do some masking in order to share it).

Appreciate any help and insight.

import numpy as np
import tensorflow as tf
import pandas as pd
import pyspark.sql.functions as f
from petastorm.spark import SparkDatasetConverter, make_spark_converter

data = spark.read.format('delta').load("<path to data>")

rel_months = ['2020-01-01', '2020-02-01'] #One month for train, one for validation (would normally have much more for train)
rel_data = data.where(f.col('Date').isin(rel_months))

max_date = max(rel_months)
train = rel_data.where(f.col('Date') != max_date)
val = rel_data.where(f.col('Date') == max_date)

numeric_features = ['<list of numeric features>']
cat_features = ['<list of categorical string features']
time_features['<list of time (month, quarter, etc) features>']

MAX_EPOCHS = 3
dense_layer_sizes = [1024, 512, 256]

def get_category_encoding_layer(name):
  index = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary = unique_value_dictionary[name], name = name+'_lookup')
  #Unique Value Dictionary is an already created dictionary that has unique values for each string column in the dataset
  encoder = tf.keras.layers.experimental.preprocessing.CategoryEncoding(max_tokens=index.vocab_size(), name = name+'_enc')
  return lambda feature: encoder(index(feature))

def CreateModel():
  all_inputs=[]
  encoded_features = []

  for header in numeric_features:
    numeric_col = tf.keras.Input(shape=(1,), name = header)
    all_inputs.append(numeric_col)
    encoded_features.append(numeric_col)

  for header in cat_features + time_features:
    if len(unique_value_dictionary[header]) < 2:
      print('Skipping ' + header + ' because not enough unique values')
      continue
    cat_col = tf.keras.Input(shape=(1,), name=header, dtype = 'string')
    enc_layer = get_category_encoding_layer(header)
    encoded = enc_layer(cat_col)
    all_inputs.append(cat_col)
    encoded_features.append(encoded)

  # Model Building

  x = tf.concat(encoded_features, axis = 1, name = 'concat_inputs')
  for i, layer_size in enumerate(dense_layer_sizes):
    x = tf.keras.layers.Dense(layer_size, activation='relu', name = 'dense_'+str(i))(x)
    if (i != len(dense_layer_sizes)):
      x = tf.keras.layers.Dropout(.25, name = 'dropout_'+str(i))(x)
  output = tf.keras.layers.Dense(1, name = 'output_layer')(x)

  model = tf.keras.Model(all_inputs, output)

  return model

purchaser = CreateModel()
optimizer = tf.keras.optimizers.Adam(learning_rate = .0001)
purchaser.compile(optimizer=optimizer, loss = tf.keras.losses.MeanAbsoluteError())

#### Training locally (pandas)

def df_to_dataset(spark_dataframe, shuffle = True, batch_size = 32, target_col = 'Sales'):

  ds = tf.data.Dataset.from_tensor_slices(
    ({feature: spark_dataframe.select(feature).toPandas()[feature].to_numpy() for feature in spark_dataframe.columns if feature!=target_col}, 
    spark_dataframe.select(target_col).toPandas()[target_col].to_numpy())
  )
  if shuffle:
    ds = ds.shuffle(buffer_size = spark_dataframe.count())
  ds = ds.batch(batch_size)
  ds = ds.cache()
  ds = ds.prefetch(batch_size)
  return ds

train_batch_size = 512
val_batch_size = train_batch_size * 2

train_ds = df_to_dataset(train, batch_size = train_batch_size)
val_ds = df_to_dataset(val, False, val_batch_size)

checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only = True, monitor = 'val_loss', mode = 'min', save_best_only = True)
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=tensorboard_path, update_freq='epoch')
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(patience = 5, factor = .5)
early_stop = tf.keras.callbacks.EarlyStopping(patience = 50)

my_callbacks = [checkpoint, tensorboard_callback, reduce_lr, early_stop]

train_history = purchaser.fit(train_ds,
             epochs = MAX_EPOCHS, 
             validation_data = val_ds, validation_steps = np.max([1, len(val_ds) // val_batch_size]), 
             callbacks = my_callbacks,
             verbose = 2)

#### Training with Petastorm

spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, "file:///dbfs/ml/petastorm/cache")
converter_train = make_spark_converter(train, dtype = None)
converter_val = make_spark_converter(val, dtype = None)

purchaser = CreateModel()
optimizer = tf.keras.optimizers.Adam(learning_rate = .0001)
purchaser.compile(optimizer=optimizer, loss = tf.keras.losses.MeanAbsoluteError())

with converter_train.make_tf_dataset(batch_size = train_batch_size) as train_dataset, \
   converter_val.make_tf_dataset(batch_size = val_batch_size) as val_dataset:

  train_dataset = train_dataset.map(lambda x:
                                    ({train.columns[i]: x[i] for i in range(len(train.columns))},
                                    x.Sales))

  val_dataset = val_dataset.map(lambda x:
                                    ({val.columns[i]: x[i] for i in range(len(val.columns))},
                                    x.Sales))

  steps_per_epoch = len(converter_train) // train_batch_size
  val_steps = np.max([1, len(converter_val) // val_batch_size])

  purchaser.fit(train_dataset,
                steps_per_epoch = steps_per_epoch,
                epochs = MAX_EPOCHS,
                validation_data = val_dataset, 
                validation_steps = val_steps,
                callbacks = my_callbacks,
                verbose = 2)

Using the above code I get the following benchmark speeds:

In Memory: 28.2 minutes for the first epoch, and then 2.6 minutes each for the second and third (not entirely sure why the first epoch takes so long beyond "loading the data")
Petastorm: 7.3 minutes for each of the three epochs

So the 50x was definitely an exaggeration! Apologies for that, I swear I previously had results that were ~120 seconds and ~4,800 seconds per epoch for the two methods.

I know that petastorm has some overhead that will slow it down compared to in-memory training. Does this ~3x ratio seem right to you in terms of speed?

Any ideas on how else I could optimize the process and make training time faster? (I normally use multi-gpu training, and not sure how much more I can increase the batch size). All advice or recommendations for readings would be appreciated.

Hi. Thank you for the detailed example (although, because of the length of the example it took me to gather up courage to start reading it, hence the long response time :) ).

Can you please confirm that you are measuring the time it takes to run the fit statement (without all the preparation, e.g. make_tf_dataset contexts __enter__)
I am suspicious of {train.columns[i]: x[i] for i in range(len(train.columns)) that you run on each sample. For the kind of data that you have it might end up being pretty expensive. Can you rework your code so that the in-memory dataset also runs it on per-sample basis during the epoch? Could it be what makes the difference here?
You can try playing with make_tf_dataset(..., workers_count=...) argument which is 4 by default, but you could try higher values (e.g. 10). Depends on what is actual the bottleneck it may make a difference. Note that you will be increasing your RAM footprint by increasing that number.
Another interesting experiment could be to run a tight loop that would fetch from train_dataset without the suspicious map call (from the second bullet). Can you iterate over the epoch faster then 7.3 minutes? Does anything change when you bring back the map call?

I know that petastorm has some overhead that will slow it down compared to in-memory training. Does this ~3x ratio seem right to you in terms of speed?

Obviously, it's hard to compete with reading data from memory. However, the competition here is with your model's fwd/bwd propagation time. As long as we can get higher data loading rate than your model processing rate, we are in a good spot (so it depends on the model). It's important to streamline data flow and not have any heavy python code run on your batches. That's why I suspect the per-sample map call in your code.

Looking forward to hear from you. Hope we can nail this problem down.

Apologies for the long example, I appreciate you taking the time to read it.

I'm simply reporting the per-epoch times that tensorflow outputs using model.fit(...,verbose=2), so it should be excluding any of the preparation.
I agree that I am also suspicious of the dictionary comprehension lol. I originally had it in there because I thought my model pipeline required a dictionary of tensors as an input. However I recently tried replacing map(lambda x:({train.columns[i]: x[i] for i in range(len(train.columns))}, x.Sales)) with a simple map(lambda x: (x, x.Sales)). Unfortunately, this did not affect speed whatsoever. -Taking the map call out entirely gives me an error, format `x`, `(x,)`, `(x, y)`, or `(x, y, sample_weight)`, found: inferred_schema_view(Col_1=<tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=float64>, Col2...... Every tensorflow implementation of petastorm that I've seen has used a map function, is there a way around this?

To try and estimate the impact of the map transform I've run the following code:

with converter_train.make_tf_dataset(batch_size = train_batch_size) as train_dataset:
  train_dataset = train_dataset.map(lambda x: ({train.columns[i]: x[i] for i in range(len(train.columns))}, x.Sales))
  i = 0
  for elem in train_dataset:
    i+=1
    if i > len(converter_train) // train_batch_size:
      break

With the map statement and without it. In both cases it took the same amount of time. Does this seem like a valid way of determining the map's impact?

Do you happen to know of any examples of using Petastorm for structured data with tensorflow that I could take a look at? Currently I'm basing everything off of this databricks example this databricks example (as I'm using the databricks platform).

In the mean time I am going to try and create a simpler model pipeline to hopefully make things easier to detect.

Thanks, wishing I could give you some github karma

No problem at all...

Can you please take a look at the horovod example:

https://github.com/horovod/horovod/blob/master/examples/spark/keras/keras_spark_rossmann_run.py, I know they were polishing training pipeline performance and have a good batch-based implementation. Perhaps it will give you some clues.

uber / petastorm

Performance comparison of make_reader() & make_petastorm_dataset() vs make_spark_converter() & make_tf_dataset() #644