Open filipski opened 4 years ago
These are interesting results - thank you for doing the benchmark!
numpy()
that happens twice. Does it result in memory copy? Also, saturated curve as a function of number of workers means there is some synchronization between thread that is going on. The immediate suspect is GIL, IMO. I would be curious to see a benchmark that is using TF graph that is not Python interpreter heavy. Also, interesting if tf_tensors
API would perform better. I'd construct a graph that outputs a scalar, to make sure we are not measuring time of tf.tensor->numpy memory copies.Thanks for quick response!
It makes perfect sense. I unfortunately can't spend time on benchmarking data without decoding images, as it's not really my use case. Would be cool if you could trace if there's actually some (unnecessary?) memory copy there.
tf.data.Dataset - well, actual measurements were done with the code calculating the size of the data commented out (see below for the updated snippet), so there should not be any cost from numpy() operations. And I didn't ever define and use any graph. You were right about the GIL though, testing with process workers shows increasing performance with increasing the worker number. It's comparable to process workers with pure Python reader, but significantly slower than thread workers with pure Python reader
If you think that tf.data.Dataset is bad, then tf_tensors is horrible :). It's both very low performance and it's dead flat regardless the amount of workers used or the type of the workers. There's of course a chance that I made some silly mistake in the code below, too. Please check.
Here's new chart and new spreadsheet Results.xlsx Fewer data points than 5 taken in some cases, as it's pretty time consuming and there was not much variance in the results anyway. BTW, it was run on a machine equipped with Intel Core i9-7920X CPU @ 2.90GHz (12 cores/24 threads) with 64GB of RAM. TF version 2.1.0
The code:
with make_reader(input_path, workers_count=workers_count, reader_pool_type=workers_type, hdfs_driver=hdfs_driver) as reader:
logging.debug("Number of workers: {}".format(reader._workers_pool.workers_count))
if mode == 'python':
# Pure python
tic2 = time.time()
for row in reader:
total_number_of_files += 1
frame = row.frame
#total_frames_size += frame.size
annotations = row.annotations
#total_annotations_size += len(annotations)
if mode=='tf_dataset':
# Tensorflow tf.data.Dataset API
dataset = make_petastorm_dataset(reader)
tic2 = time.time()
for tensors in dataset:
total_number_of_files += 1
#total_frames_size += tensors.frame.numpy().size
#total_annotations_size += len(tensors.annotations.numpy())
if mode=='tf_tensors':
with tf.Session() as sess:
sess.run([tf.global_variables_initializer(), tf.local_variables_initializer()])
tic2 = time.time()
try:
while True:
row_tensors = tf_tensors(reader)
sample = sess.run(row_tensors)
#total_frames_size += sample.frame.size
#total_annotations_size += len(sample.annotations)
total_number_of_files += 1
except:
logging.debug("Done. No more items in the data set")
Do you plan any work related to this?
Sorry, lost track of this issue.
I reran your benchmark using this PR: #602
I do not see that significant difference between pure python and tf_records/tf_dataset (please note that I moved tf_tensor(reader)
call out of while loop)
Sum - rate | Data | pool | ||||
---|---|---|---|---|---|---|
process | thread | |||||
threads | python | tf_dataset | tf_tensors | python | tf_dataset | tf_tensors |
1 | 25.5548183979054 | 25.643627818292 | 24.4595838463214 | 67.9241246647714 | 65.3858062489132 | 57.9563138896486 |
5 | 87.8800553939457 | 81.014358367927 | 81.9316672614454 | 179.749914165352 | 97.2471961061929 | 90.3378716746835 |
10 | 111.460690113826 | 105.78842708046 | 97.6678109551823 | 198.158347564204 | 87.7691973016213 | 87.9979016308476 |
20 | 109.634276342388 | 97.2842802351933 | 86.9972991619546 | 188.85918297127 | 80.9359250634738 | 78.1072371696909 |
30 | 103.148657539689 | 87.7607569099133 | 84.8377671658696 | 180.189079567539 | 75.4382755851483 | 77.1775098514257 |
40 | 91.6783230753876 | 79.5871777955041 | 75.299993024394 | 197.831277338971 | 74.6876305859258 | 68.7289296843975 |
Looking at the thread pool version (process pool - the graphs are almost exactly the same):
indeed, Python is faster, but only by a factor of 2 and not 6 as in your experiment. Perhaps that could be excused by moving data into TF (numpy->tf tensor) and back (tf tensor -> numpy). That being said, the factor of x2 is pretty significant, but not entirely out of ballpark (I was expecting something closer to x1.2 - x1.4).
Also, in my experiments, I do observe same performance between tf_tensors and tf.dataset.
My raw results: Results.xlsx
Hello,
I made some benchmarks based on the data set of over 2600 png images with JSON annotations, totaling of 3.9GB. My baseline was simply reading all of them from folders on local ext4 file system over an NVMe SSD drive. Images were loaded and decoded with OpenCV imread() and JSONs by simple file read(). I pushed the very same data into a Petastorm data set, stored on the same partition of that SSD drive, partitioned into 4 parquet files, each roughly 950MB, with rowgroup size of 128MB, with a very simple schema:
Then I made read benchmarks with 1, 2, 5, 10 and 20 workers, using the following code:
And the results were surprising:
Here is the chart summarizing my measurements: and excel sheet with details, if someone is interested: Results.xlsx