Performance benchmarks - issues with tf.data.Dataset API reader and question about the pure Python one

Hello,

I made some benchmarks based on the data set of over 2600 png images with JSON annotations, totaling of 3.9GB. My baseline was simply reading all of them from folders on local ext4 file system over an NVMe SSD drive. Images were loaded and decoded with OpenCV imread() and JSONs by simple file read(). I pushed the very same data into a Petastorm data set, stored on the same partition of that SSD drive, partitioned into 4 parquet files, each roughly 950MB, with rowgroup size of 128MB, with a very simple schema:

BenchmarkSchema = Unischema('BenchmarkSchema', [
    # Frames
    UnischemaField('frame', np.uint8, (1080, 1280, 3), CompressedImageCodec('png'), True),
    # Annotations
    UnischemaField('annotations', np.string_, (), ScalarCodec(StringType()), True)
])

Then I made read benchmarks with 1, 2, 5, 10 and 20 workers, using the following code:

    with make_reader(input_path, workers_count=workers_count, hdfs_driver=hdfs_driver) as reader:
        logging.debug("Number of workers: {}".format(reader._workers_pool.workers_count))
        if mode == 'python':
            # Pure python
            tic2 = time.time()
            for row in reader:
                total_number_of_files += 1
                frame = row.frame
                total_frames_size += frame.size
                annotations = row.annotations
                total_annotations_size += len(annotations)

        if mode=='tf_dataset':
            # Tensorflow tf.data.Dataset API
            dataset = make_petastorm_dataset(reader)
            tic2 = time.time()
            for tensors in dataset:
                total_number_of_files += 1
                total_frames_size += tensors.frame.numpy().size
                total_annotations_size += len(tensors.annotations.numpy())

And the results were surprising:

for pure Python reader with one worker - it was slower than than simple reading all those files from the file system (done also in one thread). Why? I was expecting some boost from Petastorm (or at least similar speeds) from the fact the filesystem I/O had to open and close thousands of files, while Petastorm dataset consisted of just 4 parquet files. Adding workers surely improved performance, but it would probably do with reading plain files in multiple threads, too.
for Tensorflow tf.data.Dataset API reader - initially low performance with a single worker (0.84 of a baseline) improved slightly when adding the second one, but then stayed more or less flat regardless the amount of extra workers. Is is a bug, e.g. related to https://github.com/uber/petastorm/issues/190 ?

Here is the chart summarizing my measurements: SpeedupChart and excel sheet with details, if someone is interested: Results.xlsx

These are interesting results - thank you for doing the benchmark!

I would guess that in the local setup, with fast IO, the bottleneck would not be IO, but the decoding of the images. That could possibly explain that number of file open/close operations did not matter much. With higher cost of a round trip, reading large blocks with close to a hundred images (given the image size of 1.5MB based on your data^) would make a difference comparing to per-file access. Perhaps there is some extra memory copy in Petastorm implementation that may explain some slowdown comparing to a your baseline code.
tf.data.Dataset number seem bad. I did not touch TF codebase recently, especially eager mode, so I might be missing something. Your benchmark does use eager evaluation of tensors. I am about the cost of evaluating numpy() that happens twice. Does it result in memory copy? Also, saturated curve as a function of number of workers means there is some synchronization between thread that is going on. The immediate suspect is GIL, IMO. I would be curious to see a benchmark that is using TF graph that is not Python interpreter heavy. Also, interesting if tf_tensors API would perform better. I'd construct a graph that outputs a scalar, to make sure we are not measuring time of tf.tensor->numpy memory copies.

Thanks for quick response!

It makes perfect sense. I unfortunately can't spend time on benchmarking data without decoding images, as it's not really my use case. Would be cool if you could trace if there's actually some (unnecessary?) memory copy there.
tf.data.Dataset - well, actual measurements were done with the code calculating the size of the data commented out (see below for the updated snippet), so there should not be any cost from numpy() operations. And I didn't ever define and use any graph. You were right about the GIL though, testing with process workers shows increasing performance with increasing the worker number. It's comparable to process workers with pure Python reader, but significantly slower than thread workers with pure Python reader
If you think that tf.data.Dataset is bad, then tf_tensors is horrible :). It's both very low performance and it's dead flat regardless the amount of workers used or the type of the workers. There's of course a chance that I made some silly mistake in the code below, too. Please check.

Here's new chart and new spreadsheet Results.xlsx Fewer data points than 5 taken in some cases, as it's pretty time consuming and there was not much variance in the results anyway. BTW, it was run on a machine equipped with Intel Core i9-7920X CPU @ 2.90GHz (12 cores/24 threads) with 64GB of RAM. TF version 2.1.0

The code:

with make_reader(input_path, workers_count=workers_count, reader_pool_type=workers_type, hdfs_driver=hdfs_driver) as reader:
    logging.debug("Number of workers: {}".format(reader._workers_pool.workers_count))
    if mode == 'python':
        # Pure python
        tic2 = time.time()
        for row in reader:
            total_number_of_files += 1
            frame = row.frame
            #total_frames_size += frame.size
            annotations = row.annotations
            #total_annotations_size += len(annotations)

    if mode=='tf_dataset':
        # Tensorflow tf.data.Dataset API
        dataset = make_petastorm_dataset(reader)
        tic2 = time.time()
        for tensors in dataset:
            total_number_of_files += 1
            #total_frames_size += tensors.frame.numpy().size
            #total_annotations_size += len(tensors.annotations.numpy())

    if mode=='tf_tensors':
        with tf.Session() as sess:
            sess.run([tf.global_variables_initializer(), tf.local_variables_initializer()])
            tic2 = time.time()
            try:
                while True:
                    row_tensors = tf_tensors(reader)
                    sample = sess.run(row_tensors)
                    #total_frames_size += sample.frame.size
                    #total_annotations_size += len(sample.annotations)
                    total_number_of_files += 1
            except:
                logging.debug("Done. No more items in the data set")

Do you plan any work related to this?

Sorry, lost track of this issue. I reran your benchmark using this PR: #602 I do not see that significant difference between pure python and tf_records/tf_dataset (please note that I moved tf_tensor(reader) call out of while loop)

Sum - rate	Data	pool
	process			thread
threads	python	tf_dataset	tf_tensors	python	tf_dataset	tf_tensors
1	25.5548183979054	25.643627818292	24.4595838463214	67.9241246647714	65.3858062489132	57.9563138896486
5	87.8800553939457	81.014358367927	81.9316672614454	179.749914165352	97.2471961061929	90.3378716746835
10	111.460690113826	105.78842708046	97.6678109551823	198.158347564204	87.7691973016213	87.9979016308476
20	109.634276342388	97.2842802351933	86.9972991619546	188.85918297127	80.9359250634738	78.1072371696909
30	103.148657539689	87.7607569099133	84.8377671658696	180.189079567539	75.4382755851483	77.1775098514257
40	91.6783230753876	79.5871777955041	75.299993024394	197.831277338971	74.6876305859258	68.7289296843975

Looking at the thread pool version (process pool - the graphs are almost exactly the same):

indeed, Python is faster, but only by a factor of 2 and not 6 as in your experiment. Perhaps that could be excused by moving data into TF (numpy->tf tensor) and back (tf tensor -> numpy). That being said, the factor of x2 is pretty significant, but not entirely out of ballpark (I was expecting something closer to x1.2 - x1.4).

Also, in my experiments, I do observe same performance between tf_tensors and tf.dataset.

My raw results: Results.xlsx

uber / petastorm

Performance benchmarks - issues with tf.data.Dataset API reader and question about the pure Python one #584