Open panfengfeng opened 5 years ago
@selitvin Can you help to check this performance issue? it's very urgency for us.
Ok, will be able to look at this today.
@selitvin Thanks.
It's a bit hard to understand exactly what's going on without playing with the particular dataset. Here are some high level thoughts:
In your TF example the data path behind the two versions you are showing is quite different:
I think the cost of starting a tf session run, unbatching, shuffling and map_and_batch (that seems could be scheduling the work on additional threads) could be significant, so the python and tf code are not likely to have the same performance.
As for comparing with TFRecord performance: if you always access all fields in your training dataset, it is expected to be faster since it uses all-c++ implementation underneith. I think the value of petastorm, is that you can stream subsets of fields (parquet columns) without incurring the cost of loading/parsing the entire record. Unfortunatelly there is some performance cost attached to it.
Some tweaking some make_batch_reader
arguments that might have some effect on the performance:
workers_count
hdfs_driver='libhdfs'
: it may be more efficient in some scenarios.@selitvin so there exists some copy and transformation overheads with TF?
Have you compared parquet with make_petastorm_dataset and tfrecord on big data? We test petastorm and tfrecords to construct tensorflow dataset for training. TFrecords is faster 2-3X than petastorm on ImageNet dataset(about 145G, 1280000 images). Did you do some similar test for that? @selitvin
No, I did not compare make_petastorm_dataset with tf_records head-to-head on large image data. I can try doing that. I assume you keep the images compressed in the dataset (~115KB per image on average) and decode them using TF facilities, similar to TF approach. I'll add this benchmark and try to see what are the bottlenecks. It's an interesting experiment to run. Indeed 2-3x difference seems a bit too high IMO. A quick question: how does 2-3x related to your previous comment?
dataset takes 58s, 57s, tfrecords takes 41s, 19s
where it is 40% more (which would be more in line with what I would've expect)?
@selitvin we resize the raw imageNet dataset, after resizing, the dataset size is 24GB, but we also test the raw imageNet, the results are as follows:
parquet dataset api takes 202s, 199s tfrecord dataset api takes 193s, 88s
@selitvin So how to use petastorm in your productive environment? use dataset or tensor for tensorflow?
2-3X(speed up) is second epoch or later: TFrecord 57s Vs petastorm 19s.
The setup we are most experienced with is:
materialize_dataset
). We do use petastorm codecs for images and do not decode them manually in TF
24GB data set (1.28 million images), stored as parquet files with 800 mb each with total as 31 parquet files. I have 56 core cpu machine with 180GB RAM. I use "make_petastorm_dataset" to scan the dataset, for 2 times (2 epochs), the sample code is as follows:
` with make_batch_reader(dataset_url, num_epochs=FLAGS.num_epochs) as train_reader:
Also I just use for-loop to scan the dataset as follows: ` with make_petastorm_reader(dataset_url, num_epochs=FLAGS.num_epochs) as train_reader:
for-loop takes 36s, 20s, dataset takes 58s, 57s, tfrecords takes 41s, 19s
you see that second epoch of dataset takes 57s, larger than for-loop and tfrecords, so it has some overheads when using make_petastorm_dataset