uber / petastorm

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Apache License 2.0
1.8k stars 284 forks source link

Performance benchmarks - issues with tf.data.Dataset API reader and question about the pure Python one #584

Open filipski opened 4 years ago

filipski commented 4 years ago

Hello,

I made some benchmarks based on the data set of over 2600 png images with JSON annotations, totaling of 3.9GB. My baseline was simply reading all of them from folders on local ext4 file system over an NVMe SSD drive. Images were loaded and decoded with OpenCV imread() and JSONs by simple file read(). I pushed the very same data into a Petastorm data set, stored on the same partition of that SSD drive, partitioned into 4 parquet files, each roughly 950MB, with rowgroup size of 128MB, with a very simple schema:

BenchmarkSchema = Unischema('BenchmarkSchema', [
    # Frames
    UnischemaField('frame', np.uint8, (1080, 1280, 3), CompressedImageCodec('png'), True),
    # Annotations
    UnischemaField('annotations', np.string_, (), ScalarCodec(StringType()), True)
])

Then I made read benchmarks with 1, 2, 5, 10 and 20 workers, using the following code:

    with make_reader(input_path, workers_count=workers_count, hdfs_driver=hdfs_driver) as reader:
        logging.debug("Number of workers: {}".format(reader._workers_pool.workers_count))
        if mode == 'python':
            # Pure python
            tic2 = time.time()
            for row in reader:
                total_number_of_files += 1
                frame = row.frame
                total_frames_size += frame.size
                annotations = row.annotations
                total_annotations_size += len(annotations)

        if mode=='tf_dataset':
            # Tensorflow tf.data.Dataset API
            dataset = make_petastorm_dataset(reader)
            tic2 = time.time()
            for tensors in dataset:
                total_number_of_files += 1
                total_frames_size += tensors.frame.numpy().size
                total_annotations_size += len(tensors.annotations.numpy())

And the results were surprising:

Here is the chart summarizing my measurements: SpeedupChart and excel sheet with details, if someone is interested: Results.xlsx

selitvin commented 4 years ago

These are interesting results - thank you for doing the benchmark!

filipski commented 4 years ago

Thanks for quick response!

  1. It makes perfect sense. I unfortunately can't spend time on benchmarking data without decoding images, as it's not really my use case. Would be cool if you could trace if there's actually some (unnecessary?) memory copy there.

  2. tf.data.Dataset - well, actual measurements were done with the code calculating the size of the data commented out (see below for the updated snippet), so there should not be any cost from numpy() operations. And I didn't ever define and use any graph. You were right about the GIL though, testing with process workers shows increasing performance with increasing the worker number. It's comparable to process workers with pure Python reader, but significantly slower than thread workers with pure Python reader

  3. If you think that tf.data.Dataset is bad, then tf_tensors is horrible :). It's both very low performance and it's dead flat regardless the amount of workers used or the type of the workers. There's of course a chance that I made some silly mistake in the code below, too. Please check.

Here's new chart and new spreadsheet Results.xlsx Fewer data points than 5 taken in some cases, as it's pretty time consuming and there was not much variance in the results anyway. BTW, it was run on a machine equipped with Intel Core i9-7920X CPU @ 2.90GHz (12 cores/24 threads) with 64GB of RAM. TF version 2.1.0

image

The code:

with make_reader(input_path, workers_count=workers_count, reader_pool_type=workers_type, hdfs_driver=hdfs_driver) as reader:
    logging.debug("Number of workers: {}".format(reader._workers_pool.workers_count))
    if mode == 'python':
        # Pure python
        tic2 = time.time()
        for row in reader:
            total_number_of_files += 1
            frame = row.frame
            #total_frames_size += frame.size
            annotations = row.annotations
            #total_annotations_size += len(annotations)

    if mode=='tf_dataset':
        # Tensorflow tf.data.Dataset API
        dataset = make_petastorm_dataset(reader)
        tic2 = time.time()
        for tensors in dataset:
            total_number_of_files += 1
            #total_frames_size += tensors.frame.numpy().size
            #total_annotations_size += len(tensors.annotations.numpy())

    if mode=='tf_tensors':
        with tf.Session() as sess:
            sess.run([tf.global_variables_initializer(), tf.local_variables_initializer()])
            tic2 = time.time()
            try:
                while True:
                    row_tensors = tf_tensors(reader)
                    sample = sess.run(row_tensors)
                    #total_frames_size += sample.frame.size
                    #total_annotations_size += len(sample.annotations)
                    total_number_of_files += 1
            except:
                logging.debug("Done. No more items in the data set")
filipski commented 4 years ago

Do you plan any work related to this?

selitvin commented 4 years ago

Sorry, lost track of this issue. I reran your benchmark using this PR: #602 I do not see that significant difference between pure python and tf_records/tf_dataset (please note that I moved tf_tensor(reader) call out of while loop)

Sum - rate Data pool        
  process     thread    
threads python tf_dataset tf_tensors python tf_dataset tf_tensors
1 25.5548183979054 25.643627818292 24.4595838463214 67.9241246647714 65.3858062489132 57.9563138896486
5 87.8800553939457 81.014358367927 81.9316672614454 179.749914165352 97.2471961061929 90.3378716746835
10 111.460690113826 105.78842708046 97.6678109551823 198.158347564204 87.7691973016213 87.9979016308476
20 109.634276342388 97.2842802351933 86.9972991619546 188.85918297127 80.9359250634738 78.1072371696909
30 103.148657539689 87.7607569099133 84.8377671658696 180.189079567539 75.4382755851483 77.1775098514257
40 91.6783230753876 79.5871777955041 75.299993024394 197.831277338971 74.6876305859258 68.7289296843975

Looking at the thread pool version (process pool - the graphs are almost exactly the same):

image

indeed, Python is faster, but only by a factor of 2 and not 6 as in your experiment. Perhaps that could be excused by moving data into TF (numpy->tf tensor) and back (tf tensor -> numpy). That being said, the factor of x2 is pretty significant, but not entirely out of ballpark (I was expecting something closer to x1.2 - x1.4).

Also, in my experiments, I do observe same performance between tf_tensors and tf.dataset.

My raw results: Results.xlsx