uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Regarding performance of make_petastorm_dataset #387

Open panfengfeng opened 5 years ago

panfengfeng commented 5 years ago

24GB data set (1.28 million images), stored as parquet files with 800 mb each with total as 31 parquet files. I have 56 core cpu machine with 180GB RAM. I use "make_petastorm_dataset" to scan the dataset, for 2 times (2 epochs), the sample code is as follows:

` with make_batch_reader(dataset_url, num_epochs=FLAGS.num_epochs) as train_reader:

train_dataset = make_petastorm_dataset(train_reader) .apply(tf.data.experimental.unbatch())
train_dataset = train_dataset.shuffle(buffer_size=3 * FLAGS.batch_size)
train_dataset = train_dataset.apply(tf.data.experimental.map_and_batch(parse_record,
                                                                       num_parallel_calls=16,
                                                                       batch_size=FLAGS.batch_size))

train_iterator = train_dataset.make_one_shot_iterator()
train_tensor = train_iterator.get_next()

i = 0
with tf.Session() as sess:
  sess.run([
    tf.local_variables_initializer(),
    tf.global_variables_initializer(),
  ])

  start = time.time()
  try:
    while True:
      cur_image, cur_label = sess.run(train_tensor)
      for image in cur_image:
        i += 1
        if i % FLAGS.record_num == 0:
          end = time.time()
          print("time is " + str(end - start))
          start = end

  except tf.errors.OutOfRangeError:
    print("Finish! the number is " + str(i))`

Also I just use for-loop to scan the dataset as follows: ` with make_petastorm_reader(dataset_url, num_epochs=FLAGS.num_epochs) as train_reader:

i = 0
start = time.time()

for schema_view in train_reader:
  cur_image = schema_view.image
  for image in cur_image:
    i += 1
    if i % FLAGS.record_num == 0:
      end = time.time()
      print("time is " + str(end - start))
      start = end

print("Finish! the number is " + str(i))`

for-loop takes 36s, 20s, dataset takes 58s, 57s, tfrecords takes 41s, 19s

you see that second epoch of dataset takes 57s, larger than for-loop and tfrecords, so it has some overheads when using make_petastorm_dataset

xubo245 commented 5 years ago

@selitvin Can you help to check this performance issue? it's very urgency for us.

selitvin commented 5 years ago

Ok, will be able to look at this today.

xubo245 commented 5 years ago

@selitvin Thanks.

selitvin commented 5 years ago

It's a bit hard to understand exactly what's going on without playing with the particular dataset. Here are some high level thoughts:

In your TF example the data path behind the two versions you are showing is quite different:

I think the cost of starting a tf session run, unbatching, shuffling and map_and_batch (that seems could be scheduling the work on additional threads) could be significant, so the python and tf code are not likely to have the same performance.

As for comparing with TFRecord performance: if you always access all fields in your training dataset, it is expected to be faster since it uses all-c++ implementation underneith. I think the value of petastorm, is that you can stream subsets of fields (parquet columns) without incurring the cost of loading/parsing the entire record. Unfortunatelly there is some performance cost attached to it.

Some tweaking some make_batch_reader arguments that might have some effect on the performance:

panfengfeng commented 5 years ago

@selitvin so there exists some copy and transformation overheads with TF?

xubo245 commented 5 years ago

Have you compared parquet with make_petastorm_dataset and tfrecord on big data? We test petastorm and tfrecords to construct tensorflow dataset for training. TFrecords is faster 2-3X than petastorm on ImageNet dataset(about 145G, 1280000 images). Did you do some similar test for that? @selitvin

selitvin commented 5 years ago

No, I did not compare make_petastorm_dataset with tf_records head-to-head on large image data. I can try doing that. I assume you keep the images compressed in the dataset (~115KB per image on average) and decode them using TF facilities, similar to TF approach. I'll add this benchmark and try to see what are the bottlenecks. It's an interesting experiment to run. Indeed 2-3x difference seems a bit too high IMO. A quick question: how does 2-3x related to your previous comment?

dataset takes 58s, 57s, tfrecords takes 41s, 19s

where it is 40% more (which would be more in line with what I would've expect)?

panfengfeng commented 5 years ago

@selitvin we resize the raw imageNet dataset, after resizing, the dataset size is 24GB, but we also test the raw imageNet, the results are as follows:

parquet dataset api takes 202s, 199s tfrecord dataset api takes 193s, 88s

xubo245 commented 5 years ago

@selitvin So how to use petastorm in your productive environment? use dataset or tensor for tensorflow?

2-3X(speed up) is second epoch or later: TFrecord 57s Vs petastorm 19s.

selitvin commented 5 years ago

The setup we are most experienced with is: